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Preface 



Physical systems which right themselves after being disturbed evoke our curiosity 
because we want to understand how such systems are able to react to unexpected 
stimuli. The mechanisms are all the more fascinating when systems are composed 
of small, simple units, and the ability of the system to self-stabilize emerges 
out of its components. Faithful computer simulations of such physical systems 
exhibit the self-stabilizing property, but in the realm of computing, particularly 
for distributed systems, we have greater ambition. We imagine that all manner of 
software, ranging from basic communication protocols to high-level applications, 
could enjoy self-corrective properties. 

Self-stabilizing software offers a unique, non-traditional approach to the cru- 
cial problem of transient fault tolerance. Many successful instances of modern 
fault-tolerant networks are based on principles of self-stabilization. Surprisingly, 
the most widely accepted technical definition of a self-stabilizing system does 
not refer to faults: it is the property that the system can be started in any ini- 
tial state, possibly an “illegal state,” and yet the system guarantees to behave 
properly in finite time. This, and similar definitions, break many traditional 
approaches to program design, in which the programmer by habit makes as- 
sumptions about initial conditions. The composition of self-stabilizing systems, 
initially seen as a daunting challenge, has been transformed into a manage- 
able task, thanks to an accumulation of discoveries by many investigators. Re- 
search on various topics in self-stabilization continues to supply new methods for 
constructing self-stabilizing systems, determines limits and applicability of the 
paradigm of self-stabilization, and connects self-stabilization to related areas of 
fault tolerance and distributed computing. 

The Workshop on Self-Stabilizing Systems (WSS) is the main forum for re- 
search in the area of self-stabilization. The first workshop was held in Austin 
(1989), and since 1995, workshops have been held biennially: Las Vegas (1995), 
Santa Barbara (1997), Austin (1999), and Lisbon (2001). WSS 2001 was thus 
our first workshop held outside North America, and reflected the strong growth 
and international participation in the area. We received 27 submitted papers 
for this workshop, which is a 50% increase from the previous workshops. The 
program committee selected 14 of the submitted papers, and Sukumar Ghosh 
presented our invited contribution. 

This volume covers many areas within the field and reflects current trends 
and new directions in self-stabilization. Important applications of distributed 
computing are topics in several papers (routing, group membership, publish- 
subscribe systems). Other papers strike a methodological tone, describing tools 
to construct self-stabilizing systems. Three papers have “agent” in their titles, 
which is a topic not mentioned in any previous workshop. Several papers in- 
vestigate non-standard definitions (or weakenings) of self-stabilization. Our field 
continues to grow and evolve. 
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Cooperating Mobile Agents and Stabilization 



Sukumar Ghosh* 

The University of Iowa 
ghoshScs . uiowa.edu 



Abstract. In the execution of distributed algorithms on a network of 
processes, the actions of the individual processes are scheduled by their 
local schedulers or demons. The schedulers communicate with their im- 
mediate neighbors using shared registers or message passing. This paper 
examines an alternative approach to the design of distributed algorithms, 
where mobile agents are allowed to traverse a network, extract state in- 
formation, and make appropriate modification of the local states to steer 
the system towards a global goal. The primary emphasis of this paper 
is system stabilization. Both single-agent and multi-agent protocols are 
examined, and the advantages and disadvantages of agent-based stabi- 
lization are discussed. 



1 Introduction 

Consider the execution of a program on a network of processes. The processes 
communicate with one another through shared memory or message passing. Each 
process has a scheduler (also called a demon) that collects information about the 
states of the its neighboring processes, and schedules its local actions. Two of 
the well-known execution models rely on (i) central demon and (ii) distributed 
demons. In the central demon model, processes execute their actions serially, 
whereas in the distributed demon model, any subset of the set of enabled pro- 
cesses can execute their actions concurrently. 

This paper explores an alternative model for computation that uses mobile 
agents in the context of stabilization. A mobile agent m is a program that can 
migrate from one node to another, perform various types of operations at these 
nodes, and take autonomous routing decisions. In contrast with messages that 
are passive, an agent is an active entity, that can be compared with a messenger. 

Conventional stabilizing distributed systems expect processes to run prede- 
fined programs that have been carefully designed to guarantee recovery from all 
possible bad configurations. However, in an open environment like the Internet 
where processes are not under the control of a single administration, expecting 
every process to modify its program for the sake of stabilization is unrealistic, 
and a more centralized mechanism for network administration is an viable al- 
ternative. This paper explores such an alternative mechanism for stabilizing a 
distributed system, in which processes are capable of accommodating visiting 
agents. 

* This research was supported in part by the National Science Foundation under grant 
CCR-9901391. 



A.K. Datta and T. Herman (Eds.): WSS 2001, LNCS 2194, pp. 1-^21 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 



2 



Sukumar Ghosh 



We assume that the underlying system to be stabilized uses message passing 
for interprocess communication. Note that we could as well consider interprocess 
communication through shared memory, but the choice has been made only for 
the sake of uniformity - agents propagate across links exactly like messages. The 
state of the network consists of the states of each of the processes, as well as 
those the channels. Thus, an action by a process to update its own state, or to 
send or receive messages also modifies the state of the network. The set of states 
of the network can be classified into the categories legal and illegal. A stabilizing 
system guarantees that regardless of the starting state, the network eventually 
reaches a legal state, and remains in the legal state thereafter. Convergence and 
closure are the two cornerstones of stabilizing systems 0. 

We further assume that in addition to the ongoing activities in the above 
network of processes, one or more processes can send out mobile agents that can 
migrate from one process to another, read the local states of the visited processes, 
and update these local states whenever necessary. While the individual processes 
maintain the closure of legal configurations, agents take up the responsibility of 
steering the system to a legal configuration when they detect a configuration to 
be illegal. The detection involves taking a total or partial snapshot of the system 
state. Corrective actions involve the modification of the states of one or more 
processes visited by the agents. 

In the past, Kutten, Korach, and Moran H2] used tokens with identities to 
solve the leader election problem. Each process initiates a network traversal with 
a token labeled with its identity. In the paper Distributed Reset 0, processes 
first elect a leader and then construct a spanning tree with the leader as the 
root. Subsequently, the sending of the reset waves can be viewed as the send- 
ing out reset agents by the leader down the network. In the area of electronic 
commerce, the use of mobile agents has been steadily increasing. In network 
management, primitive agents mimicking biological entities like ants (see the 
article on Swarms ini) traversing a network have been used in solving various 
problems like shortest-path computation and congestion control. In such sys- 
tems, the individual agents do not have explicit problem-solving knowledge, but 
intelligent action emerges from the collective action by ants. These papers pro- 
vide the general motivation behind exploring how mobile agents can be utilized 
to stabilize distributed systems. In our systems, an agent is a sequential pro- 
gram with no obvious limit on its intelligence - problems can be solved either 
by a single agent, or by a group of agents. Furthermore, there is the additional 
challenge that the initial system configuration can be arbitrary, and the agents 
themselves may be corrupted in transit. 

This paper is about using agents as a tool for stabilization, and not about a 
formal computational model using agents, which appears in 0. The paper has 
six sections. Section 2 provides a general description of the agent model. Section 
3 illustrates the construction of a spanning tree using a single reliable agent. 
Section 4 explains how to implement a reliable agent. Section 5 presents two dif- 
ferent solutions to the spanning tree construction problem using multiple agents. 
Finally Section 6 contains some concluding remarks about agent-stabilizing sys- 
tems. 
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2 The Agent Model 

An agent consists of the following six componenttQ 

1. The identifier id, usually the same as the initiator’s id. The id is unnecessary 
if there is a single agent, but is essential to distinguish between multiple 
agents in the same system. 

2. The agent program A 

3. The briefcase B containing a set of variables 

4. The previous process PRE visited by the agent 

5. The next process to visit NEXT that is computed after every hop 

6. A supervisory program S for bookkeeping purposes. 

An agent may be installed into a system from outside, or it may be installed 
from within the system. Unless stated otherwise, we will consider the agent to 
be externally installed into a designated node called the home or the initiator. 
In a given system, the number of agents can vary, but for our purpose, we will 
need at least one agent. When the home of the agent is internally designated, it 
is done through an initial leader election by the component processes. 

When multiple agents are required, we can use either a static model, or a 
dynamic model. In the static model, either a single home process sends out a 
fixed number k [k > 1) of agents, or k distinct home processed (each authorized 
to send out agents) are identified before the computation starts. In the dynamic 
model, an agent is entitled to create an unspecified number of new agents, and 
destroy them as and when necessary. 

A hop by an agent is an atomic step used to move from one process to a 
neighbor. Each hop of the agent costs one message, and is completed in unit 
time. We assume that an agent performs the local computation at any process 
in zero time. The communication links are half-duplex, as a result, two agents 
traveling in opposite directions along a path consisting of a chain of processes 
are guaranteed to meet each other at some process. The message complexity is 
determined by the total number of hops taken by all the agents to restore the 
system to a legal configuration. The time complexity is determined by the total 
number of time units needed to return to a legal configuration. In addition, if 
the agents are internally installed, then to calculate the overall time or message 
complexity, the overheads of leader election have to be taken into account. 

By choosing the agent model where the progress of computation and the 
restoration of legitimacy are controlled by a handful of mobile agents instead of 
an army of demons, we are clearly moving towards centralization of authority. 
We present two motivations behind such a decision. 



^ For convenience, we will use capital letters to represent agent-related variables or 
programs. 

^ We somehow need to distinguish k processes from the rest, and designate them as 
initiators. 
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Motivation 1. At the risk of sounding trivial, consider a stabilizing solution 
to the problem of maxima finding on a ring of n processes using a single agentfl 
For each process i, designate one successor neighbor{i) to which the agent can 
move to make a trip round the ring. The goal is to set the local variable max 
of every process to the largest id in the ring. These id’s are positive integers. 
The home process will initialize the briefcase variables MAX (that records the 
largest id) and MODE (e {0,1,2}) with 0 and 0. The protocol is presented in 
Fig. 1. 



Program for the agent while visiting process i 



agent variables MODE, MAX; 
process variables max; 

if MAX<id A MODE<2 ^ MAX:=id; 
□ MODE=2 ^ max(i) := MAX; 

fi; 

NEXT:= neighbor 
Program for the home process 



Executed whenever the agent reaches home 
initially MAX=0, MODE—0 

if MOD < 2 ^ MODE— MODE + 1 
□ MODE=2 ^ MAX:=0; MODE— 0 
fi; 

NEXT — neighbor 



Fig. 1. A stabilizing solution for maxima finding using a single agent. 

It can take at most two roundtrip traversals for the agent to set MAX 
correctly to the largest id, and one more traversal to write this value into the 
individual processes. Thus the message complexity is 3(n — 1). Note that to solve 
the same problem on the traditional message passing or shared memory models, 
we will need O(n^) steps. This is not an isolated example - similar observations 
can be made about many other solutions too. 



Motivation 2. Solutions using a single agent model can often be derived from 
existing sequential algorithms through a simple adaptation mechanism. Sequen- 
tial graph algorithms, for example, specify at each step which node will execute 
the next action. The adaptation by the corresponding agent model involves mak- 
ing the agent move to the position of next action (using a traversal algorithm) 

® The leader process need not be the same as the process with maximum id. It can, for 
example, be a process with the smallest id, or may satisfy any other unique criteria. 
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prior to executing the action. The preference of a suitable sequential algorithm 
may depend, among other things, on the extent of movement of the agent. 

Our stabilizing protocols work under the following two constraints: 



Non-interference. The normal operation of the system neither depends on, 
nor is influenced by the presence of agents. 

As an exception, only the home process of an agent can test the arrival of 
that agent (as a part of evaluating its guards), initialize or update the agent’s 
briefcase, and send out the agent to begin its next traversal (as a part of its 
action). Such an agent is sent out at convenient intervals to initiate a “clean-up 
phase” , otherwise the operation of the system continues as usual. The individual 
processes are oblivious to the presence of the agent. We disregard any minor 
slowdown in the execution speed of a process due the sharing of the resources 
by visiting agents. 



Atomicity. At any node, the arrival of an agent triggers the agent program 
whose execution is atomic. The agent program ends with the departure of the 
agent from that node, or with a waiting phase (in case the agent has to wait at 
that node for another agent to arrive), after which the execution of the applica- 
tion program at that node resumes. 

The computation at a node alternates between the agent program and the 
application program. At any node, the visit of a single agent can be represented 
by the following sequence of events. We denote an atomic event using ( ): 

agent arrives, ( agent program executed ), agent leaves 

When two agents / and J have to meet at a node k to exchange data, the 
sequence of events will be as follows: 

agent I arrives, ( agent program of / executed ), agent I waits at k 

Following this, the application program at node k resumes, and continues until 
the other agent J arrives. When J arrives at node k, the following sequence of 
events take place: 

agent J arrives, ( data exchange with / occurs ), agent / and J leave. 

Then the application program at k resumes once again. 

When the agent is not externally installed, it is possible to implement the 
agent model using shared memory or message passing. An outline is as follows: 
Let the processes elect a leader and designate it as the home of the agent. Put 
a copy of the agent program A (and the supervisory program S whose role will 
be explained later) at every process in the network, including the version for 
the home process at the leader. Also, put a copy of the briefcase variables B of 
the agent at the home process. Now, let the leader execute the agent program 
A, and send the updated briefcase variables of the agent program as messages 
to the next process whose identity is determined by the execution of A. After 
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receiving this message, the recipient will execute A, and repeat the same steps 
as its predecessor. This concludes the outline. 

Agent-based stabilization can be viewed as a stabilizing extension of a dis- 
tributed system as proposed in ^31 ■ While [El emphasized the feasibility of de- 
signing stabilizing distributed systems, we argue that mobile agents have some 
interesting properties that make implementations straightforward. 



3 Stabilization Using a Single Agent 

3.1 Spanning Tree Construction 

Here we first present a stabilizing protocol for the construction of a DFS span- 
ning tree using a single agent. The solution is taken from In the subsequent 
sections, we will build on this to present our multi-agent protocols. Let g = {v, e) 
represent the topology of the undirected graph, where v is the set of nodes, and 
e is the set of edges. In the single-agent protocol, the home of the agent is the 
root of the spanning tree. We use p{i) to designate the parent of a node i. The 
additional variables are as follows: 

child{i) = {j : p{j) = i} 
neighbor{i) = {j : (z,j) S e} 

The program of the agent consists of three types of actions: (i) actions that up- 
date the local variables of the process that it is visiting, (ii) actions that modify 
its briefcase variables, and (iii) actions that determine the next process that it 
will visit. The individual processes are passive. 

A key issue in agent-based solution is graph traversal. To distinguish between 
consecutive rounds of traversal, we introduce a briefcase variable SEQ (g {0,1}) 
that keeps track of the most recent round of traversal. With every process i, 
define a boolean f{i) that is set to the value of SEQ whenever the process is 
visited by the agent. SEQ is complemented by the root before the next traversal 
begins. Thus, the condition f{i) ^ SEQ is meant to represent that the node has 
not been visited in the present round. 

The agent program has three basic rules: DFSl, DFS2, DFS3 and is de- 
scribec0 in Fig. 2. A proof appears in jO]. This solution disregards the case when 
an agent is trapped in a cyclic path and fails to return to the root. We will 
address this issue in the next section that deals with agent failures. 

Both the time complexity and the message complexity for stabilization are 
O(n^). Once stabilized, the agent needs 2(n — 1) hops for subsequent traversals. 

4 Agent Failure 

An agent, like any other process, is subject to failure or corruption. Since the 
agent is the main focus of control, the failure of an agent is a matter of major 
concern. 



^ It disregards the details of how a process i maintains its child{i) and their flags. 
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Program for the agent while visiting node i 



agent variables NEXT, PRE, SEQ; 
process variables f, child, p; 

If the node is already visited, then retreat 

if f(i) = SEQ A PRE ^ child(i)^ NEXT:= PRE fi 

if f(i) 7^ SEQ ^ 

Mark the current node as visited and set the parent 
f(i)— SEQ; p(i)— PRE; 
if Visit an unvisited child 

(DFSl) 3 j G child(i):f(j) / SEQ ^ NEXT:=j 

When all neighbors have been visited, return to the parent 

□ (DFS2) V j e neighbor(i): f(j)=SEQ ^ NEXT:= p(i) 

Create a path to a node that is unreachable using DFSl 

□ (DFS3) V j G child(i):f(j)=SEQ A 3 k G neighbor(i): f(k) / SEQ ^ 



Program for the home process 



Executed when the agent visits home 

if 3 j G child: f(j) / SEQ ^ NEXT:=j 
□ V j G neighbor: f(j)=SEQ ^ SEQ := 1-SEQ; 
NEXT := k : k G child 

fi 



Fig. 2. Spanning tree construction with a single agent. 



In the case of externally installed agents, we cannot take the help of the in- 
dividual processes to detect or correct a faulty agent, since no process is aware 
of the presence of the agent. The agent has to either heal itself, or kill itself after 
it detects that it is faulty. In the latter case, the home process will sense the loss 
of the agent using a timeout, and regenerate a new agent. 

Following traditions, we rule out the corruption of the agent program A or 
the supervisory program S, but take into account possible corruption of the 
agent variables, and the impact of it on the entire stabilization mechanism. We 
will see later that even the agent program A may be allowed to be corrupted as 
long as the program of the home process remains good. 

The agent traverses the network, and periodically visits its home. The home 
process appropriately updates the briefcase variables before the next traversal 
begins. To deal with agent failure, we first introduce a reliable agent. Divide the 
agent variables into two classes: privileged and non-privileged. Call an agent vari- 
able privileged, when it can be modified only by its home process - all other vari- 
ables will be called non-privileged. Examples of privileged variables are: MODE 
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in Fig. 1, SEQ in Fig. 2, or the id assigned to an agent by its home process. 
Then, an agent will be called reliable, when it satisfies the following two criteria: 

1. The agent completes its traversal of the network and returns home within a 
finite number of steps. 

2. The values of all the privileged variables of the agent remain unchanged during 
the traversal. 

An agent can be unreliable either due to the corruption of its privileged vari- 
ables during a traversal, or due to routing problems. Note that, by simply being 
reliable, an agent cannot stabilize a distributed system, but it leads to the adop- 
tion of a two-phased approach. In the first phase, we demonstrate how a reliable 
agent guarantees convergence and closure. In the second phase, we present meth- 
ods by which unreliable agents eventually become reliable, and remain reliable 
thereafter. This part will use some generic remedies, independent of the problem 
under consideration. The generic remedies are as follows: 



Loss of Agent. If the agent is killed, then the initiator discovers this using 
timeou10 and generates a new agent with a new sequence number. If the time- 
out is due to a delayed arrival of the original agent, then the original agent has 
to be killed by the leader. 

To avoid the risk of multiple agents with identical sequence numbers roam- 
ing in the system, the probabilistic technique of El can be used. It involves 
the use of a sequence number from a three-valued set E = {0, 1, 2}. If the se- 
quence number of the incoming agent matches with the sequence number of the 
previous outgoing agent, then the initiator randomly chooses the next sequence 
number from E, otherwise the agent is killed. Another approach to guarantee 
the uniqueness of the agent is to use counter-flushing m that shows how a 
single-token configuration can be restored on a ring in time 3 • R, where R is the 
roundtrip traversal time of the agent. 



Corruption of the Agent Identifier. An agent is recognized by its home 
using the agent’s id. If the id of the agent is corrupted, then the home process 
will not be able to recognize it, and the unreliable agent will roam the network 
forever. To prevent this, the supervisory program S of the agent counts the num- 
ber of hops taken by the agent. As soon as this number exceeds a predefined 
limit c- R, (c is a large constant) the agent kills itself. The same strategy works, 
if due to routing anomalies the agent is unable to return home. 



Corruption of Agent Variables. An unreliable agent with corrupted vari- 
ables can arbitrarily alter the global state of the system. Note that the corruption 

® When the agent is internally installed, it is possible to avoid a system-level timeout 
by using a modified version of the stabilizing mutual exclusion protocol presented by 
to Dijkstra [J]. In his system, the number of agents is always positive by definition. 
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of the non-privileged variables is not a matter of concern, because these are ex- 
pected to assume arbitrary values when the agent interacts with the underlying 
system. Our only concern is the possible corruption of privileged variables. 

To recover from such failures, we need to demonstrate that despite the cor- 
ruption of the privileged variables, eventually the agent reaches the global state 
to which it was initialized by its home process. For each agent-stabilizing system, 
as a part of the correctness proof, we need to prove the following theorem: 



Theorem 1. An unreliable agent is eventually substituted by a reliable agent. 

We demonstrate such a proof for the maxima finding protocol described in 
Fig. 1. Assume that the value of the privileged variable MODE has been cor- 
rupted. Since the generic remedies guarantee that the agent eventually returns 
home (or else, is killed and a new agent is generated by its home process), one of 
the two actions in the program for the home process is executed. If MODE = 2, 
then the second action changes MODE to 0. Otherwise, the execution of the 
first action increments the value of MODE until it becomes 2. Thereafter, the 
execution of the second action changes MODE to 0, which is the desired initial 
state of the reliable agent. □ 

The maximum time needed for the above recovery is 3.i?. Taking the other 
remedies into account, the time required to substitute an unreliable agent by 
a reliable one is 0{R), where R is the maximum time needed by the agent to 
traverse the network. 

For internally installed agents, all the above remedies will apply. Addition- 
ally, in some cases, processes can provide extra support in failure detection - 
for example, each process k knows the identity of the leader leader (k), so if the 
agent’s id is corrupted, then it can be detected by any non- faulty process in 
the system (Sj, and the agent can be killed. This failure detection mechanism 
is clearly unreliable, as it has the extra risk that a faulty process k with an 
incorrect value of leader{k) may suspect a reliable agent to be faulty, and kill it 
- leaving the recovery to the leader. Any fault detection must be followed by a 
fresh round of leader election. 

5 Multiple Agents 

The motivation behind the use of multiple agents is better parallelism that can 
possibly reduce the time complexity or the message complexity. The number of 
agents to be deployed to minimize these complexities depends very much on the 
nature of the problem. There are cases, where a single agent solution is the best, 
whereas there are others in which multiple agents accelerate the progress. In 
some cases, the dynamic model may perform better than the static model. We 
now present various cases illustrating these ideas. 

For notational convenience, we represent the agents by the upper case letters 
/, J, iF, • • • Our model is a synchronous one, where in unit time, every agent takes 
a step. We assume that every agent visiting a process i leaves its “footprint” by 
writing its own id to a local variable /(*), which is a set of agent identifiers. Each 
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agent can find out which other agents visited process i by examining /(i). We 
will address the issue of bad data in /(i) soon. 

From time to time, an agent may meet with another agent to exchange data 
from their briefcases. On cyclic topologies, there is a risk for deadlock due to the 
possible scenario where each agent waits indefinitely for another agent to show 
up. To prevent this, we use the following rule: 



Asymmetric Waiting. An agent with a larger id can wait for an agent with 
a smaller id, but the converse is not true. 

An agent that does not wait for another agent, simply continues with its 
traversal, and postpones its meeting for a finite number of traversals. The fol- 
lowing lemma is presented without proof: 



Lemma 1. Both deadlock and livelock are impossible. 



Fossils in the Briefcase. Fossils are bad entries in one agent’s briefcase about 
other agents or processes that no longer exist. In a legal configuration, an agent 
only keeps data pertaining to other agents that it meets, or processes that it vis- 
its. The supervisory program S of every agent keeps count of the number of hops 
made by the agent without meeting other agents or without visiting processes 
that are included in its briefcase. For each such agent or process, when the count 
exceeds predefined limits (as determined by the application), the corresponding 
entries are removed from its briefcase. 

A related issue is that of bad entries in /(i) created by agents that are now 
dead, or created due to bad initialization or by transient failures. To deal with 
this, whenever an agent J visiting a process i discovers another agent K’s foot- 
print (J ^ K) in /(i), J checks out if K exists or not. This is a required step 
regardless of whether any data exchange is involved. If agent K shows up within 
a specific time period, then both of them continue with their individual proto- 
cols, otherwise the entry for f{i) is deleted by agent J. Note that to enforce a 
timeout for the removal of bad entries in /(z), active agents use the local clock 
of process z - no active participation from the process z is necessary. Fossil man- 
agement is reminiscent of soft states used in [ 2 ] that need to be periodically 
refreshed to attain permanence, and is an important function of the supervisory 
program S. This leads to the following lemma: 



Lemma 2. An agent eventually removes all entries pertaining to agents that it 
does not meet, or processes that it does not visit. Also, all entries in /(z) that 
correspond to agents that never visited z or do not exist now, are eventually 
removed or substituted by a default value. 

As a consequence of Lemma 2, we will ignore fossils in the subsequent sec- 
tions. 
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5.1 Maxima Finding 

We revisit the maxima finding problem on a ring that was solved earlier by a 
single reliable agent in 3(n — 1) time. This time, we solve it on a static model 
containing k (1 < k < n) agents: each has a home at a designated segment of 
the network. For simplicity, we assume that the ring segments are of identical 
size ^ (Fig. 3). Each agent computes the maxima in its local segment (using 
a variation of the protocol of Fig. 1), compares the maxima when it meets a 
neighboring agent, and appropriately updates the value of max for the processes 
in their local region. 




Fig. 3. Maxima finding with multiple agents. 

Each agent starts with zero knowledge about the value of the maxima MAX. 
Thereafter, each of the k agents communicates with their neighboring agents to 
compute the maximum value MAX, after which this value is written into the 
processes (i.e max{i) := MAX) belonging to their respective segments. The pro- 
posed protocol is a modification of the protocol of Fig. 1. The variable MODE 
is now used to keep track of the number of times an agent meets a neighboring 
agent. Initially, MODE = 0. When MODE becomes 2, each reliable agent must 
have correctly read the largest id in its segment into its MAX. In addition, each 
agent obtains the largest id in its neighbors’ segments after two meetings with 
each neighboring agent. There are k agents, and the farthest agent is {k — l)/2 
segments away. This leads to the following lemma: 



Lemma 3. To find out the maxima, each of the k agents must communicate 
with its neighboring agents at least (fc — 1) times. 

Once MODE equals k, it has to move past its home at least twice to write 
the value of MAX to all the processes in its segment. The protocol is shown in 
Fig. 4. 



Theorem 2. The protocol of Fig. 4 stabilizes the system to a legal configura- 
tion in which for every process i, max{i) is equal to the largest id in the entire 
system. 
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Program for agent / when the number of agents > 1 



agent variables MODE, MAX; 
process variables max; 

if MAX<id A MODE<k ^ MAX(I):=id 
When agent I meets agent J 

□ MAX(I)<MAX(J)A M0DE< k ^ MAX(I)— MAX(J); 

MODE := MODE + 1 

□ MODE > k ^ max(i) := MAX; 

fi; 

NEXT:= neighbor 
Program for the home process 



Executed whenever the agent reaches home 
initially MAX=0, MODE=0 

if MODE = k ^ MODE:= MODE + 1 
□ MODE = k+1 ^ MAX:=0; MODE:=0 
fi; 

NEXT := neighbor 



Fig. 4. Stabilizing protocol for maxima finding using multiple agents. 



Proof. We first assume that the agents are reliable. Each agent has a variable 
MODE G {O..A:+l}. Each agent computes its MAX as MODE increases from 0 
to fc — 1, and writes this value into max{i) for each process i in its local segment, 
after which MAX is reset to 0. Any bad initial value of MAX is thus removed 
from the system in a finite number of steps. 

Consider the agent in a segment where the process with the maximum id re- 
sides. In this segment, the value of MAX is correctly set to the the largest id after 
MODE increases from 0 to 2, and this value is maintained until MAX = k+1. 
Since the agents are not simultaneously initialized by the home process, the value 
of MODE for a neighboring segment may be arbitrary. Regardless of this, as 
soon as MODE is incremented from 0 to 4 in a neighboring segment, the value 
of MAX in that segment equals the largest id in the entire system. 

Using inductive arguments we can show that the agent in the farthest seg- 
ment will set its MAX to the largest id in the system after its MODE reaches 
2 • (A: — l)/2 = fc — 1. Since for each segment the agent begins updating max 
when MODE reaches k, each segment correctly updates max for all the pro- 
cesses belonging to its segment. 

The proof for the closure of the legal configuration is trivial. 

Finally, if the agents are unreliable, then eventually both MODE and MAX 
are reinitialized to 0, 0 when they visit their homes and MODE reaches A: -I- 1. 
Thus the agents eventually become reliable. □ 
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The message complexity which is determined by the maximum number of 
hops taken by all the agents is 0{n ■ k). Paradoxically, in this case, multiple 
agents increase the message complexity. The time complexity however remains 
unchanged at 0{n) as in the single agent case. The lesson is that much of the 
work done by these agents is unproductive, and this form of parallelism does not 
lead to faster stabilization. 

5.2 Spanning Tree Construction 

We consider a connected graph g = (u, e) where the root is the home of the 
agent. A single-agent protocol for stabilizing DFS spanning tree generation is 
presented in Section 3. We now employ multiple agents to generate a spanning 
tree (not necessarily DFS) of g, with the hope for reducing the message or time 
complexity. 



The Static Model. Our static model uses a fixed number k of agents (1 < A: < 
n). The proposed protocol is an adaptation of Chen-Yu-Huang protocol |0I for 
spanning tree generation. The home of the agent with the smallest id is desig- 
nated the root of the spanning tree - we call it the root agent. The spanning tree 
generation has two layers: In the first layer, the agents work independently and 
continue to build disjoint subtrees of the spanning tree, until they meet other 
agents. In the second layer, the agents meet other agents to build appropriate 
bridges among the different subtrees - this results in a single spanning tree of 
the entire graph. 

For the first layer, we will use protocol of Fig. 2 with the only modification 
that an agent does not distinguish between a node being visited by itself, or 
by another agent. We will only elaborate on the second layer, where a pair of 
agents AT, L meet to make a decision about the bridge between them during an 
unplanned meeting at some process x. We will designate a bridge by the briefcase 
variable BB. During the meeting, one of two agents (say K) that is yet to define 
its bridge, sets its briefcase variable BB to (L,x). Thereafter, node x will have 
two parents pk and pl from the two subtrees generated by K and L ( see node 
j in Fig. 5). When node x does not have a parent in the subtree defined by agent 

K, Pk = 4>- 

The maximum number of parents for any node is min{6, k) where <5 is the 
degree of the node, and k is the number of agents. By definition, the root agent 
does not have a bridge (we use BB = T, T to represent this). 

In addition to BB, we add another non-negative integer variable Y(0 < Y < 
k) to the briefcase of every agent. By definition, Y = 0 for the root agent. Fur- 
thermore, during a meeting between two agents K, L, when K sets up its bridge 
to L, X, it also sets Y (K) to Y (L) + 1. Thus, Y (K) denotes “how many subtrees 
away” the subtree of K is from the root segment. In a consistent configuration, 
for every agent, Y < k. Therefore, if Y = fc for any agent, then the bridge for 
that subtree is invalidated. 

Fig. 6 describes the protocol for building a bridge between two subtrees. The 
home processes initialize each BB to T,T once, but like other variables, these 
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agent L's subtree 



Fig. 5. The spanning tree viewed as a graph with the nodes as subtrees and the 
edges as bridges. 



are also subject to corruption. The description of this protocol does not include 
the fossil removal actions. 



Program for agent K while meeting agent L at node i 



agent variables BB, NEXT, PRE; 
process variables p; 

initially BB = T, T; 

do BB{K) = L,i A BB{L) = K,i A K < L ^ BB{K) := T, T; 

□ BB{K)=±, T A BB{L) ^ K,i A K ^ root agent A Y(L) / fc ^ 

BB{K)-.=L,i- Y{K) ■- Y(L) + 1 

□ BB{K)=L, i- A BB{L) ^ K,i A Y(L) / fc A Y(K) 7 ^ Y{L) + 1 ^ 

Y{K) ■- Y{L) + 1 

□ BB{K) ^ L,i A BB{L) = K,i A pk{i) + PRE -> pK{i) PRE 

□ BB{K) = L,iA Y{L) = kA Y{K) ^ Y{K) := fc; 

□ Y{K) = k A BB{K) / L, i A Y{L) < k - I -> 

Bb[k) ■- L, i; Y{K) ;= Y{L) + 1 

□ BB{L) ^ K,i A PK(i) 7 ^ 4> ^ PK{i) := 0 
od; 

NEXT := PRE 



Fig. 6. Program for building a bridge between adjacent subtrees. 



Theorem 3. For a given graph, if each agent independently generates disjoint 
subtrees, then the protocol in Fig. 6 stabilizes to a spanning tree that consists 
of all the tree edges of the individual subtrees. 
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Proof Outline. As a consequence of the fossil removal mechanism, for every 
agent K, eventually BB = _L,_L or L,i, where i is a process visited by both L 
and K. By definition, each subtree has exactly one bridge BB linking it with 
another subtree. Draw a graph g' , in which the nodes are the subtrees (excluding 
the bridges) of g, and the edges are the bridges linking these subtrees. Using the 
arguments in 0, we can show that g' will eventually be connected and acyclic. 
Therefore the set of edges (connecting a node with its parents) generated by the 
protocol of Fig. 6 define a spanning tree. □ 

Note that any existing spanning tree configuration is closed under the actions 
of the protocol. 

To estimate the complexities, assume that each subtree is of equal size 
Let M{sk) be the number of messages required by a single agent K to build a 
subtree of size sk starting from an arbitrary initial state. From 0, M{sk) = 
0(s|-) Also, once the subtree is stabilized, the number of messages required to 
traverse the subtree is 2 • (s^ — 1). Since we assume sk = ^, the number of 
messages needed to build the k subtrees is k ■ M{^). To estimate the number of 
messages needed to detect a cycle in the graph g' using the condition Y > k, 
consider a cycle sqSi • • • StSo (t < k) in g' , where each node is a subree. To cor- 
rectly compute Y, each agent has to read the value of Y from the agent in its 
predecessor segment. This can take upto H-2-|-3-|-----|-(f— 1) = traversals 
of the subtrees. Since the maximum value of t is k, for correctly detecting cycles 
in g\ at most ^ will be required. Also, each time a cycle is broken, the 

number of disjoint subtrees in g' is reduced by one (see p]), so this step can be 
repeated no more than (k — 1) times. Therefore the maximum number of mes- 
sages needed for the construction of a spanning tree using a set of cooperating 
reliable agents will not exceed 

k-0{^) + {k-!^-^) = 0{f + n-k^) 

To estimate the worst-case message complexity, we also need to take into account 
the overhead of fossil removal. This is determined by the number of hops taken 
by the agents to traverse the subtrees of size which is 0{{j) ■ k) = 0{n) . Note 
that this does not increase the order of the message complexity any further. The 
interesting result, at least with this particular protocol is that as the number 
of agents increases, the message complexity first decreases, and then increases. 
The minimum message complexity is 0(ri3) when k = 0{n3). 

To estimate the time complexity, assume that each of the k agents simul- 
taneously builds subtrees of size ^ in time 0(|r)- The time required by the k 
agents to correctly establish their Y values is k ■ 0{j) = 0{n). At this time, 
the condition Y > k can be correctly detected. The resulting actions reduce the 
number of disjoint subtrees by 1, so these action can be repeated at most {k— 1) 
times. The time complexity is thus 0{^ + n ■ k). The overhead of fossil removal 
(which is O(f-)) does not increase the time complexity any further. Therefore, 
the smallest value of the time complexity is O(n^) when k = 0{ni). 

The Dynamic Model. Finally, we examine the dynamic model of multiple 
agents, where an agent can create multiple agents as and when necessary, dele- 
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gate subtasks to these agents, and kill them when they complete their subtasks. 
There is a strong similarity between this approach and that used in many wave 
algorithms US]. The key ideas are: 

1. When an agent reaches a “fork” (defined as a node of degree 5, 5 > 2) it 
creates 5 — 1 child agents, one for each remaining edge, for traversing the 
rest of the graph. 

2. When an agent visits a node that (i) has a degree 5 = 1, or (ii) has already 
been visited by another agent, it retreats. 

3. When the children of an agent return to their parent, the parent retrieves 
the required data from the child agents, and then kills them. 

Agents spawn children with a limited lifespan and a limited agenda. When 
an agent spawns child agents at a node i, that node plays the role of the home 
for its children. The parent installs its own programs A and S into the child 
agentS The children of an agent / bear the identification tag I : J, and the 
parent agent assigns distinct values of J to each child. Each child follows the 
single agent protocol, until it returns to its parent agent. 

For each process i, we recognize at most one agent K that reaches that node 
first. Accordingly, f{i) is set to {K, SEQ) where SEQ is a boolean fiag repre- 
senting the sequence number of the most recent visit by K . When a process has 
not been visited by any agent, f{i) = T,T. For fossil management, any agent 
K, after reaching a node i that claims to have been visited by another agent L 
(L ^ K), waits for L to show up. If indeed L shows up, then both K and L con- 
tinue with their protocols, otherwise agent K resets f{i) to T, T. The program 
for each agent is shown in Fig. 7 that assumes all agents to be valid and reliable. 



Program for agent K while visiting node i 



agent variables NEXT, PRE, SEQ; 
process variables f, child, p; 

if (f(i) = L, - V f(i) = K, SEQ) A PRE ^ child(i)^ NEXT:= PRE fi; 
if f(i) = K, 1- SEQ ^ 

Mark the current node as visited and set the parent 
f(i):= K, SEQ; p(i) := PRE; 

if S > 2 — > V j G neighbor (i)\p(i): create a child agent with NEXT:=j 

□ all child agents are back — > kill the child agents; NEXT:= p(i) 

fl; 

fi 



Fig. 7. The spanning tree protocol in the dynamic model of multiple agents. 



We now examine the additional types of agent failures in the dynamic model 
of multiple agents. Each child agent eventually returns to its parent whose iden- 



There is one limitation - we expect that at least S will be correctly installed. 
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tity can be derived by stripping the last component of its own id. The death of 
the parent agent converts its child agents into orphans. Using the supervisory 
program S, each child agent increments a counter with every hop that it takes, 
until it meets its parent agent. As a result, an orphan unable to locate its parent 
eventually finds the value of this counter exceeding the maximum possible size 
of the system. At this point, the orphan kills itself. 

Due to the overhead of fossil management, the message complexity of the 
above protocol is not better than that of the single-agent protocol. However, 
the time complexity is 0{h?) where h the height of the spanning tree with the 
home as the root node. This result is not surprising, but is encouraging for dense 
graphs. 



6 Concluding Remarks 

This paper demonstrates that agents can be a viable tool for implementing sta- 
bilizing distributed systems. It also demonstrates the power of a single focus of 
control. 

The agent model can tolerate the corruption of the agent program A, if 
the home process correctly installs this every time the agent visits home. Note 
that the supervisory program S must be incorruptible, since it has the crucial 
responsibility of failure detection. 

It is possible to solve a stabilization problem by first devising an agent-based 
protocol, and then implementing the agent(s) using the method outlined in Sec- 
tion 4. The complexity of the overall solution is determined by the sum of (i) the 
complexity of the agent-based protocol, and (ii) the complexity of stabilizing the 
agents which themselves could possibly be unreliable and (iii) the complexity of 
a stabilizing protocol for leader election. We claim that in many cases, the overall 
complexity is comparable to that of a traditional message-based solution. When 
the agent fails less frequently than the underlying distributed system, there is a 
potential to amortize the overhead due to the second component, which further 
reduces the stabilization time. 

The other noteworthy observation is the issue of productive and unproductive 
parallelism. The observation that the time complexity of single-agent protocols 
is sometimes comparable to that achieved by distributed demons or alterna- 
tors m reveals that uncontrolled parallelism does not necessarily lead to faster 
stabilization. Multiple agents, installed at appropriate processes, or spawned at 
appropriate times have the potential of achieving productive parallelism, and 
therefore faster stabilization. The multi-agent protocols are examples of divide- 
and-conquer strategies in stabilization that needs further exploration. 

The biggest advantage of agent stabilization seems to be the ease of derivation 
of single-agent protocols from known sequential algorithms for graphs, conver- 
sion of those into multi-agent protocols using from the known techniques for 
parallelization. On the negative side, agent-based protocols are not silent. The 
time delay between the occurrence of a failure and the next visit of the agent to 
address that failure may be a matter of concern if there are too few agents, or if 
the network is very large. 
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Abstract. We study a special type of self-stabilizing algorithms com- 
position : the cross-over composition {AoB). The cross-over composition 
is the generalization of the algorithm compiler idea introduced in 0. 
The cross-over composition could be seen as a black box with two entries 
and one exit. The composition goal is to improve the qualities of the first 
algorithm A, using as medium the second algorithm B. Informally, the 
obtained algorithm is A after the transfer of B’s properties. 

Here, we provide a complete analysis of the composition, when the algo- 
rithms {A and B) are deterministic and/or probabilistic algorithms. 
Moreover, we show that the cross-over composition is a powerful tool in 
order to enforce a scheduler to have a fair behavior regarding to A. 



1 Introduction 

The idea of composing self-stabilizing algorithms in order to improve their adapt- 
ability was introduced by Gouda and Herman in ^j. In their approach an al- 
gorithm is composed by a number of k layers such that the layer i, 1 < i < k 
depends on the variables which stabilize due to the actions of the layers from 1 to 
i — 1. The proof of convergence of the composed algorithm follows by induction. 

In the same paper, the authors present another type of composition which uses 
a selection predicate. The two modules which enter in the composition do not 
inter-communicate, but they are allowed to modify the same output variables. 
At a given time, the selection predicate is true only for one module (module that 
is allowed to modify the output variables) while the other module is waiting the 
flipping of predicate value. 

Another type of independent module composition was defined by Varghese in 
m The entities interact by means of their outputs. The obtained algorithm is 
the composition of the modules. 

A special form of composition was defined by Dolev and Herman in 0. The goal 
of this composition is to accelerate the self-stabilization of an algorithm P. P 
will pick up the result of the fastest self-stabilizing algorithm of (Si), i G I in 
order to perform its own task. This technique needs some fair scheduler. 

The cross-over composition AoB goal is to improve the qualities of the algorithm 
A, using as medium the second algorithm B. Informally, the obtained algorithm 
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is the algorithm A after the transfer of B’s computation properties. The main 
use of the cross-over composition is the transformation of a self-stabilizing al- 
gorithm A under weak scheduler (central, alternating, k-fair, fair, ...) into an 
algorithm {Ao B) which maintains the self-stabilization property under any un- 
fair scheduler. 

Moreover, we guarantee that the composed algorithm will satisfy the conjunc- 
tion of the properties of the algorithms as in the Varghese composition. We show 
that all liveness and safety properties of A and B are also conveyed hy AoB iff 
B is fair when A and B are deterministic and/or probabilistic algorithms. 

In Sec. El the model and self-stabilization definitions for deterministic and prob- 
abilistic algorithm are given. The cross-over definition is presented in Sec. El In 
Sect.0 we study the propagation of the self-stabilization property. We explain 
how to use the cross-over composition to transform any self-stabilizing algorithm 
under some specific scheduler into an algorithm that converges under any unfair 
scheduler in Sec.0 

2 Model 

Distributed Systems. A distributed system can be modeled by a transition sys- 
tem. A transition system is a three-tuple S = (C,T,X) where C is the collection 
of all configurations, I is a subset of C called the set of initial configurations, 
and T is a function from C to the set of C subsets. A C subset of T(c) is called a 
c transition. An element of a c transition t, is called an output of t. In a proba- 
bilistic system, there is a probabilistic law defined on the output of a transition; 
in a deterministic system, each transition has only one output. In Fig. 0 we can 
see the COO transition called CHOO that has four outputs: Cll, C12, C21 and 
C22. 

The abstract model defined above is a mathematical representation of the reality. 
In fact, the distributed system is the collection of processors that communicate 
only with theirs processor neighbors to execute a distributed algorithm. 

A eomputation of a distributed system DS is a sequence of computation steps. 
A maximal computation is a sequence such that it is either infinite, or with a 
deadlock terminal configuration. The computations set of a distributed system 
DS is denoted by Sds- A maximal computation e is fair if and only if any pro- 
cessor performs infinity often an action. A fair computation e is k-fair if and 
only if between two actions of a processor, any other processor performs at most 
k actions. A maximal computation e is k-bounded if and only if along e, till a 
processor p is enabled to perform an action, another processor can perform at 
most k actions. A /c-fair computation is fc-bounded; but the converse is not true. 
When the distributed algorithm prevents the fairness because some processors 
are no more enabled, the fc-bounded property guarantees the fairness between 
“enabled” processor. On a network of 4 processors (pi, p2, p3, p4), the following 
computation is not 1-fair {pi's action, pH' s action)* but it is 1-bounded if along 
this computation p2 and p4 are never enabled to perform any action. 
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Scheduler. In this model, a scheduler is a predicate over the system computa- 
tions. In a computation, a transition (ci,Ci+i) occurs due to the execution of a 
nonempty subset of the enabled processors in the configuration c^. In every com- 
putation step, this subset is chosen by the scheduler. At a computation step, a 
central scheduler chooses an enabled processor to execute its action; A distributed 
unfair scheduler chooses any nonempty subset of the enabled processors at each 
computation step. A k-bounded scheduler produces only fc-bounded computa- 
tions: it ensures the /c-fairness between processors that are enabled to perform 
an action. An alternating scheduler produces only alternating computations: be- 
tween two actions of a processor p each p's neighbor performs one and only one 
action. 

An algorithm under a scheduler D is fair (resp. A:-fair) if any computation of 
the algorithm under D is fair (resp. A:-fair). When the property of fairness (resp. 
fc-fairness) is verified by an algorithm under any scheduler then the algorithm is 
simply called fair (resp. fc-fair). 

Built on previous works on probabilistic automata (see p.lll3lin]) b g) presented 
a framework for proving self-stabilization of probabilistic distributed systems 
based on the notion of strategy. A strategy is the set of computations that can 
be obtained under a specific scheduler choice. At the initial configuration, the 
scheduler “chooses” one set of enabled processors (it chooses a transition). For 
each output of the selected transition, the scheduler chooses a second transition, 
and so on. The formal strategy definition is based on the tree of computations. 
Let c be a configuration. A TS-tree rooted in c, T ree(c), is the tree-representation 
of all computations beginning in c. Let nd be a node in Treeic) (i.e. a config- 
uration), a branch rooted in nd is the set of all Tree{c) computations starting 
in nd with a computation step of the same nd transition. The degree of nd is 
the number of branches rooted in nd. A sub-TS-tree of degree 1 rooted in c is a 
restriction of Tree(c) such that the degree of any Tree{c)’s node (configuration) 
is at most 1. Figure |2| contains a strategy rooted in COO. A strategy may have a 
non-countable number of infinite computations. A strategy is defined as follows: 

Definition 1 (Strategy). Let DS be a distributed system, let D be a scheduler 
and let c be a configuration. We call a strategy of DS under D rooted in c a 
sub-TS-tree of degree 1 of Tree{c) such that any computation of the sub-tree 
satisfies the scheduler D. 

Let st be a strategy of the distributed system DS, an st-cone Ch is the set of all 
possible st-computations with the same prefix h (for more details see gll|). The 
last configuration of h is denoted lastQi). 

We have equipped a strategy with a probabilistic space (see g) for more de- 
tails). The measure of an st-cone Ch is the measure of h (i.e., the product of the 
probability of every computation step occurring in h). An st-cone Ch' is called 
a sub-cone of Ch if and only if ft, is a prefix of h' . Let st be the strategy of Fig. 
El let ft be the prefix (COO, cftOO, C12)(C12, cftl2, C56); in st, the probability of 
Ch is p\ ■ (1 -PeY- 
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Deterministic Self-Stabilization. In order to define self-stabilization for a dis- 
tributed system, we use two types of predicates: the legitimate predicate (defined 
on the system configurations and denoted by C) and the problem specification 
(defined on the system computations and denoted by VS). To prove the self- 
stabilization to SP, one has to prove that all computations reach a legitimate 
configuration and that from a legitimate configuration any computation satisfies 
the predicate SP. For instance, the leadership problem specification is “there is 
one leader in the network, called p, and p stays the only leader forever” . In this 
case, a legitimate configuration would be a configuration where there is one and 
only one leader. The correctness proof consists in ensuring that the system does 
not diverge from a legitimate configuration: once p is the only leader, no other 
processor becomes leader and p keeps its leadership. 

Let T be a set and Pred be a predicate defined on X. The notation x h Pred 
means that the element x oi X satisfies the predicate Pred. 

Definition 2 (Deterministic Self-Stabilization). Let DS be a distributed 
system. DS is self-stabilizing for a specification VS if and only if the following 
two properties hold: 

• convergence — all computations of DS reach a configuration that sat- 
isfies the legitimate predicate denoted L. Formally, Ve G Sds e = 
((co,ci)(ci,C 2 ) . . .) : 3n > l,c„ h C; 

• correctness — all computations starting in configurations satisfying 
the legitimate predicate satisfy the problem specification VS. Formally, 
Ve G £ds e= ((co,ci) (ci,C 2 ) . . .) : cq F £ ^ e h VS. 

Probabilistic Self-Stabilization. Let DS be a distributed system. A predicate P is 
closed for the computations of DS if and only if when P holds in a configuration 
c, P also holds in any configuration reachable from c. 

Notation 1. Let DS be a distributed system, D be a scheduler and st be a strat- 
egy of DS under D. Let CP be the set of all system configurations satisfying a 
closed predicate P (formally Vc G CP,c P). The set of st- computations that 
reach configurations of CP is denoted by £V and its probability by Prst{SV). 



Definition 3 (Probabilistic Stabilization). A distributed system DS zs self- 
stabilizing under a scheduler D for a specification VS if and only if there exists 
a closed legitimate predicate C defined on configurations such that in any strategy 
st of DS under D, the following conditions hold: 

• convergence — The probability of the set of st- computations, that reach 
a configuration satisfying C is 1. Formally, \/st, Prst{£T) = 1- 

• correctness — Any computation starting in a configuration satisfying 
L satisfies the specification VS. 




Cross-Over Composition - Enforcement of Fairness under Unfair Adversary 



23 



3 Cross-Over Composition 

3.1 Definitions 

In the sequel, we define the cross-over composition AoB. The cross-over compo- 
sition could be seen as a black box with two independent entries (two algorithms 
that do not share any variable) and one exit. The composition goal is to improve 
the qualities of the first Algorithm A, using as medium the second Algorithm 
B. The two algorithms which enter in the composition have different parts. A, 
referred in the following as the weak algorithm is the target of the transforma- 
tion. B referred as the strong algorithm is the transformation medium which 
transfers its properties to the weak algorithm. 

The actions of A are synchronized with the actions of B: when an A action 
is performed then a B action is performed too. Thus, the computations of the 
composite algorithm under any scheduler have the same properties as the com- 
putations of B in term of fairness. 

• when a processor p performs an action of A it performs simultaneously 
an action of B (both action guards were satisfied on p); 

• a processor p may perform an action of B without performing an action 
of A (in this case all action guards of A are disabled on p) . 

The strong algorithm B acts as a computation filter for the weak algorithm A: 
A will only deal with computations that can be obtained by B under the current 
scheduler: D. The obtained algorithm AoB has the properties of B under D 
and the properties of A under a scheduler that produces “S’s computations” . 

Definition 4. 

Let A he an algorithm with n actions as follows: 

Vz G {1, . . . , n} < guard Oi > < action Oi > 

Let B be an algorithm with m actions as follows: 

Vj G {!,..., m} < guard hi > < action bj > 

Assume that A and B do not share any variable. The cross-over composition 
Ao B is the algorithm with the m.{n 1) following actions: 

Vz G {!,..., n},Vj G {!,..., m} 

< guard ai > A < guard bj > < action ai >;< action bj > 

Vj G {!,..., m} 

< guard ai > A . . .A^ < guard a„ > A < guard bj > < action bj > 



Example 1. Let r be a unidirectional ring of n processors. Algorithm B has one 
integer variable v and one action: on any processor p (Ip being the left neighbor 
of p), Vp < vip Vp := vip 1. Any computation of B has a suffix that is 
alternating. Algorithm A has two variables Currentlist and BackupList (list of 
processor ids) and one action (without guard): p copies the content of 

CurrentList of its left neighbor into its own CurrentList; then it concatenates 
its own id at the end of the list. If p id appears two times in this list, p copies 
the segment between its ids into the backupList. If in p’s CurrentList, p’s id 
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appears three times then p empties its CurrentList. The cross-over composition 
A o B has three variables: one integer and two lists of processors id; and the 
following action: 

bl. Vp < vip Vp := vip + 1; p copies the content of CurrentList of its 

left neighbor into its own CurrentList] then .... 

As the computations oi Ao B have an alternating suffix; one may prove that 
the algorithm stabilizes: every BackupList list will contain the ordered list of 
processors id on the ring. 

A probabilistic self-stabilizing leader election algorithm on anonymous ring is 
presented in 0. This algorithm is the cross-over composition of three algorithms 
(LoRTC)oDTC: DTC ensures that the computations are fc-fair, RTC provides 
a token circulation on the ring if the computations are A:-fair, and L manages 
the leadership under the assumption that a token circulates in the ring. 



Observation 1. One may notice that the algorithm A2 o (A1 o B) is not the 
algorithm {A2o Al)o B . To prove that one may study the cross-over composition 
of three simple algorithms (i.e. each algorithm has one action). 



3.2 Deterministic Properties Propagation 

In the following, we study the propagation of deterministic algorithm properties 
on the obtained algorithm. 

Lemma 1 (Propagation of Properties on Compntations). Let Ao B be 

the cross-over composition between the algorithms A and B. Let P be a predicate 
on B ’s computations. If any maximal computation of B under the scheduler D 
satisfies the predicate P then any maximal computation ofAoB under D satisfies 

P. 

Proof. Assume that there is at least one maximal computation of AoB, e, under 
D which does not satisfy the predicate P. The projection of e on S is unique 
and maximal. Let cb be this projection. Since e does not satisfy the predicate 
P then cb does not either which contradicts the Lemma hypothesis. 



Corollary 1 (Propagation of Fairness). Let AoB be the cross-over com- 
position between the algorithms A and B. If Algorithm B is fair under D then 
Ao B is a fair algorithm under D. 



Corollary 2 (Propagation of fc- Fairness). Let Ao B be the cross-over com- 
position between the algorithms A and B. If B is k-fair under a scheduler D 
then Ao B is k-fair under D. 

The proof of the following Lemma is similar to the proof of the Lemma □ 
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Lemma 2 (Propagation of Convergence Properties). Let Ao B be the 

eross-over composition between the algorithms A and B. Let P be a predicate 
on the B ’s configurations. Lf any maximal computation of B under the sched- 
uler D reaches a configuration which satisfies the predicate P then any maximal 
computation of Ao B under D reaches a configuration which satisfies P. 

Corollary 3 (Propagation of Liveness). Let AoB be the cross-over com- 
position between the algorithms A and B. If B is without deadlock under the 
scheduler D then Ao B is without deadlock under D. 

Lemma 3 (Maximality of the Weak Projection). Let AoB be the cross- 
over composition between A and B. If B is fair under the scheduler D then the 
projection on A of any maximal computation of Ao B under D is maximal. 

Proof. Let e be a maximal computation of AoB. Let ca be the projection of e on 
A. Assume that ca is not maximal. Hence ca is finite and its last configuration 
is not a deadlock. Let e be e = 6162 where the projection of ei on A is ca and 
the projection of 62 is a maximal computation which does not contain any action 
of A. Let c be the last configuration of e\. The projection of this configuration 
on A is not a deadlock, then there is at least one processor, p, which satisfies a 
guard of A in c. Using the fairness of B and Corollary Q we prove that p executes 
an action of B in 62. According to the definition of the cross-over composition, 
during the computation of p’s first action in 62, it executes an action of A. There 
is a contradiction with the assumption that no action of A is executed in 62 . 

3.3 Propagation on a Simple Probabilistic Cross-Over Composition 

In this section, we study the properties of a strategy of Ao B when X {= A or 
H) is a probabilistic algorithm and the other one is a deterministic algorithm. 
Figure 0 displays an example of such a cross-over composition. 

Lemma 4. Let X be a probabilistic algorithm, let Y be a deterministic one, 
and let Z = X oY and W = Y o X be their cross-over composition. Let st be 
a strategy of Z or W under the scheduler D. Let stx be the projection of st on 
the algorithm X. stx is a strategy of X under D. 

Proof Outline: Let c be the initial configuration of st, and Cx its projection 
on X. Let TS{cx) be the tree representation of X computations beginning at 
Cx. Let st' be the sub-tree of TS{cx) that contains all computations that are st 
projections, st' is a strategy: all computation steps beginning at n (a node) in 
st' , belong to the same transition: nd] and all computation steps of nd transition 
are in st' . 

Theorem 1. Let X be a probabilistic algorithm, let Y be a deterministic one, 
and let Z = X o Y and W = Y o X be their cross-over composition. Let 
st be a strategy of Z or W under the scheduler D. Let stx be the projec- 
tion of st on X. Let PCx be a predicate over the X's configurations, then 
Prst{£VCx) = Prstx{£'P£x)- PEx be a predicate over the X's compu- 
tations, then Pvst{c S st | e h PEx} = PrgtxW ^ stx \ e' h PEx}. 
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Fig. 1. The projection of a strategy of A o B on A (B being a deter- 
ministic algorithm). 



Proof Outline: Let Cu be a cone of st. The projection of Ch on A is a cone of 
six'- Ch' where h' is the projection of h on X. 

3.4 Propagation on a Double Probabilistic Cross-Over Composition 

In this section, we study the properties of a strategy of Ao S when A and B are 
probabilistic algorithms. The projection of a strategy on an algorithm is not a 
strategy (Fig.0is the projection on B of the strategy of Fig. EJ. The projection 
is decomposed in strategies. 

Definitions (Derived Strategies). Let st he a strategy of AoB and let 
stprojx he the projeetion tree obtained after the projection of the computations 
in st on X (X = A or X = B). A derived strategy of stprojX is a subtree of 
stprojX whose degree equals 1. 

Observation 2. Let st be a strategy of AoB. Let stprojx be the projection of 
st on X (X = B or A). For the sake of simplicity, we assume that X = B. Let 
sts be a strategy derived from stprojB- Each cone of computations in sts, Ch\B, 
is the projection of a cone of st, Ch, which probability is given by the probability 
of Ch\B iiT' sts multiplied by 6^^. The weight of the cone of history h\B in sts 
(denoted by ) is the probability of the A computation steps executed in the 
history of Ch- Hence, Prst{Ch) = ■ PrstB{Ch\B)- 
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Fig. 2. The beginning of the strategy st of A o B. 



Definition 6 (Projection Strategies on X). We call a projection strategy 
stx a derived strategy of stprojx such that all cones of stx having the same 
length have the same weight. We note the weight of n-length cones of stx- 

For instance, Fig. Eland Fig. Elcontain 4 projection strategies on B of the strategy 
st (Fig. El- Each strategy has a different 82 value. 

Observation 3. Let N he an integer. Let st he a strategy of AoB. Let stprojx 
he the projection of .st on X (X = B or A). For the sake of simplicity, we assume 
that X = B. Let he the set of projection strategies of stprojB- There is a 
finite subset of , denoted such that ~ Sluch a subset of 

M is called a N-length B picture of st. 

Each N-length cone of st has one and only one projection on B in M^. 

There are several subsets of Mb that are “N-length B pictures of st”. 

The 4 strategies of Fig. ^ and Fig. 0 constitute a set. We have <^ 2 *” = 1 

We show that if any strategy of an algorithm for which the set of computations 
reaching a configuration satisfying a predicate P has the probability 1, then in 
any strategy of the composed algorithm, this set has the probability 1. 

Lemma 5 (Probabilistic Propagation). Let C he a predicate over the X’s 
states (X = A or X — B). Let st be a strategy of A <> B under a scheduler 
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Fig. 3. The beginning of the projection of the strategy st on B. 



D. If for every strategy stx, being a projection strategy of st on X, we have 
Pstx{£^) = 1 then Pst{£C) = 1. 

Proof. Let st be a strategy of AoB. Let stprojx be the projection of st on X {X 
= B or A). We assume that X = B. Let Mb be the set of projection strategies 
of stpj.Qj B • 

We denote hy S£n the union of the cones of a given strategy that have the 
following properties: (i) their history length is N and (ii) they have reached a 
legitimate configuration. 

Let e be a real inferior to 1. By hypothesis, there is an integer N such that 
on any strategy sts of (a Wlength B picture of st) we have Prstei^I^N) 
> 1 - e. 

PrstiSCx) = '^stieMgi^''^sti{££N) • bj^'] > (1 — e) • '^sueMg > 1 — £• In 
st, the set of computations reaching legitimate configurations in fV' < fV steps 
has a probability greater than 1 — e. 

Therefore, for any sequence ei > C 2 > £3... there is a sequence Ni < N 2 < N 3 ... 
such that Prst{££Ni) > 1 — e^. Then, lim„^oo Tst (computations 
reaching a legitimate configuration) = 1. 
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Fig. 4. 2-length beginning of projection strategies on B. 



4 Cross-Over Composition and Self-Stabilization 

In the sequel, we study the propagation of the self-stabilization property from 
an algorithm to the resulting algorithm of a cross-over composition. The propa- 
gation with self-stabilization (in the deterministic case) is a direct consequence 
of Lemma E and O when the strong algorithm (B) is the propagation initiator. 

Lemma 6. [Self-Stabilization Propagation to the Deterministic 
Algorithm AoB from Algorithm B] Let AoB be the cross-over composition 
between the deterministic algorithms A and B. If Algorithm B self-stabilizes for 
the specification SB under the scheduler D then Ao B is self-stabilizing for SP 
under D. 

Proof. The proof is a direct consequence of the LemmaQand|21 In order to prove 
the convergence we apply Lemma Q] for the property which characterizes the le- 
gitimate configurations. The correctness proof results from Lemma O applied for 
the specification SP. 

In order to ensure the liveness of the weak algorithm, the strong algorithm must 
be fair. 
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Fig. 5. 2-length beginning of projection strategies on B. 



Lemma 7. [Self-Stabilization Propagation to the Deterministic 
Algorithm Ao B from Algorithm A] Let A o B be the cross-over compo- 
sition between the deterministic algorithms A and B (B is a fair algorithm). If 
A is self-stabilizing for the specification SP under the scheduler D then Ao B 
stabilizes for the specification SP under D. 

Proof. Let e be a maximal computation of Ao B. 

• convergence of Algorithm Ao B. Let P be the predicate which charac- 
terizes the legitimate configurations of A and let ca be the projection 
of e on A. According to Lemma 01 ca is a maximal computation of A 
and ca reaches a legitimate configuration (A is self-stabilizing). 

• correctness of Algorithm AoB. Let e be a computation of AoS which 
starts in a configuration satisfying P. Let ca be its projection on A. The 
computation ca is maximal and starts in a configuration which satisfies 
P. A is self-stabilizing then ca satisfies the specification SP, hence e 
satisfies also the specification SP. 

Lemma 8. [Self-Stabilization Propagation to the Probabilistic 
Algorithm AoB from Algorithm B] Let AoB be the probabilistic cross-over 



Cross-Over Composition - Enforcement of Fairness under Unfair Adversary 



31 



composition between the algorithms A and B. If the algorithm B self-stabilizes 
for the specifieation SB under the scheduler D then A o B is a probabilistic 
self-stabilizing algorithm for SP under D. 

Proof. Let us study the propagation in the two possible cases: B is a determin- 
istic algorithm or is a probabilistic one. The idea of the proof is to analyze an 
arbitrary strategy st of A <> B under D. 

• B is deterministic. The projection on B of every computation of the 
strategy st is maximal. Every computation of st reaches a legitimate 
configuration; and then, it satisfies the specification SP. A<> B is self- 
stabilizing for SP under D. 

• B is probabilistic. Let st be a strategy of AoB. Let L be the legitimate 
predicate associated with SP. According to Lemma 0or to Theorem [D 
Pst{£L) = 1. Moreover, according to Lemmad, all computations of st 
that reach L have a suffix that satisfies SP. 

The self-stabilization propagation from the weak algorithm is possible only if 
the strong algorithm is fair. 

Lemma 9. [Self-Stabilization Propagation to the Probabilistic 
Algorithm AoB from Algorithm A] Let AoB be the probabilistic cross-over 
composition between the algorithms A and B. If Algorithm A self-stabilizes for 
the specifieation SP under the scheduler D and Algorithm B is a fair algorithm 
under D then AoB is a probabilistic self-stabilizing algorithm for SP under D. 



Theorem 2. [Self-Stabilization Propagation from A and B] Let AoB 

be the probabilistic cross-over composition between the algorithms A and B. If 
Algorithm A self-stabilizes for the specification SP under the scheduler D and 
Algorithm B is self-stabilizing for the specifieation SR and is fair under D then 
Ao B is a probabilistic self-stabilizing algorithm for SP A SR under D. 

Note that in both cases — deterministic and probabilistic — the strong algo- 
rithm will propagate the self-stabilization property to the result of composition 
without any restriction, while the propagation initiated by the weaker one can 
be realized if and only if the strong algorithm is fair. 

5 Application: Scheduler Transformation 

In this section, we present the main application of the cross-over application, 
the scheduler transformation. We show how to use the cross-over composition 
to transform any self-stabilizing algorithm under some specific scheduler into an 
algorithm that converges under any unfair scheduler. 

Definition 7 (Fragment of Owner p) . Let e be a computation of a distributed 
system and let p be a proeessor such that p executes its actions more than one 
time in e. A fragment of e of owner p, fpp is a fragment of e such that: 
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• fpp starts and finishes with a configuration where p executes an action; 

• along fpp, p executes exactly two actions (during the first and the last 
step of fpp). 

Lemma 10 (From fc- Fairness to the A:-Bound Property). Let us consider 
the cross-over composition A o B . Let e be a computation of A <> B under an 
arbitrary scheduler. If B is k-fair then the projection of e on A is k-bounded. 

Proof. Suppose that ca is not /c-bounded. Hence, there is a fragment fA of ca 
such that a processor q performs fc + 1 actions during /a and such that another 
processor p performs no action and it is always enabled along /_q . fA is the pro- 
jection of a fragment of e called /. According to the definition of AoB, f has the 
following properties (i) p performs no action in / (ii) q performs at least fc -|- 1 
actions in /. / is part of a fragment owned by p called fpp such that q performs 
at least fc-|-l actions in fpp. fpp does not exist because AoB is fc-fair (Corollary 
0 ). 

The following theorem gives a tool for transforming an algorithm A that self- 
stabilizes under a fc-bounded scheduler into an algorithm A! that self-stabilizes 
under an unfair scheduler. Let B be an algorithm whose computations are fc-fair, 
the transformed is A' = Ao B. 

Theorem 3. Let Ao B be the cross-over composition between A and B. A is a 
self-stabilizing algorithm for the specification SP under a k-bounded scheduler. B 
is a k-fair algorithm. The algorithm AoB is self-stabilizing for the specification 
SP under an unfair scheduler. 

Proof. Let e be a maximal computation of Ao B and let ca be its projection on 
the weak module. Since B is fc-fair, according to Lemma ca CA is a fc-bounded 
computation. 

Correctness Proof: Let C be the legitimate predicate associated with SP. Once 
6a has reached a legitimate configuration (a configuration that satisfies the pred- 
icate £), it satisfies the specification SP. 
convergence proof: 

• A is a deterministic algorithm: ca reaches a legitimate configuration. 

• A is a probabilistic algorithm. Let st be a strategy of Ao B under a 
distributed unfair scheduler. Let stA be a projection strategy of st on A. 
According to Lemma ^ any execution of stA is fc-bounded. According 
to the hypothesis, PstAi^^) = 1- According to LemmaElor to Theorem 
□ PstiSC) = 1 . 



Note that the transformation depends directly on the properties of the strong al- 
gorithm of a cross-over composition. The main question is “are there algorithms 
able to satisfy the fc-bound property under any unfair scheduler ?” The answer 
is positive and in the following we show some examples: 
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Protocol 


Topology 


Network type 


Scheduler transf. 


H 


general networks, bidir. 


with id 


central to unfair 


P 


general networks, bidir. 


with id 


central to unfair 


P 


general networks, bidir. 


with id 


All -bounded to unfair 


m 


rings, unidir. 


anonymous 


All-bounded to unfair 


p 


rings, unidir. 


anonymous 


alternating to central 


p 


general networks, unidir. 


anonymous 


Al 2 -bounded to unfair 



Xi = n — 1; and X 2 = n.M axOut^'^°‘"^ where MaxOut is the maximal network 
out-degree and Diam is the network diameter. 

The protocols 0 and jZj are working in the id-based networks. In the case of 
anonymous networks an algorithm which ensures the transformation of a central 
scheduler to a distributed scheduler could be the algorithm of P| executed on 
top of an algorithm which ensures an unique local naming (neighbor processors 
do not have the same id; but distant processors may have the same id). 

6 Conclusion 

We have presented a transformation technique to transform self-stabilizing al- 
gorithms under weak scheduler (k-bounded, central, alternating, ...) into algo- 
rithms which maintain the self-stabilizing property under unfair and distributed 
schedulers. 

The key of this transformation is the cross-over composition AoB'. roughly speak- 
ing, the obtained computations are the computations of A under a scheduler that 
provides the B’s computations. The cross-over composition is a powerful tool to 
obtain only specific computations under any unfair scheduler. Indeed, if all B's 
computations have “the D properties” then A only needs to be a self stabilizing 
algorithm for the specification SP under the weak scheduler D to ensure that 
A o B is a self stabilizing algorithm, for the specification SP under any unfair 
scheduler. 
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Abstract. The paper presents a technique for achieving stabilization 
in distributed systems. This technique, called agent-stabilization, uses 
an external tool, the agent, that can be considered as a special message 
created by a lower layer. Basically, an agent performs a traversal of the 
network and if necessary, modifies the local states of the nodes, yielding 
stabilization. 



1 Introduction 

Fault tolerance and robustness are important properties of distributed systems. 
Frequently, the correctness of a distributed system is demonstrated by limiting 
the set of possible failures. The correctness is established by assuming a pre- 
defined initial state and considering every possible execution that involves the 
assumed set of failures. 

The requirements of some distributed systems may include complex fail- 
ures, such as a corruption of memory and communication channels. The self- 
stabilization methodology, introduced by Dijkstra |2|, deals with systems that 
can be corrupted by an unknown set of failures. The system self-stabilizing prop- 
erty is established by assuming that any initial state is possible, and considering 
only executions without transient failures. Despite a transient failure, in a finite 
time, a self-stabilizing system will recover a correct behavior. 

Self-stabilizing systems are generally hard to design, and still harder to 
prove Pj. In this paper, we are proposing to investigate a specific technique 
for dealing with the corruption problem. The basic idea appears in a paper of 
Ghosh jS| and consists of the use of mobile agents for failure tolerance. A mobile 
agent can be considered as an external tool used for making an application self- 
stabilizing. The notion that we propose here is different from Ghosh’s notion. A 
mobile agent can be considered (in an abstract way, independently of its imple- 
mentation) as a message of a special type, that is created by a lower layer (the 
outside). We will assume that some code corruption is possible and a mobile 
agent (or more simply an agent) can carry two types of data: a code, that can be 
installed at some nodes, and some information (referred to as contained in the 
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briefcase), that can be read or modified by the processors. Code and briefcase 
are specific to the application controlled by the agent. 

The outside can create a single agent and install it on an arbitrary node (the 
initiator). As long as the agent has not been destroyed, no other agent can be 
created. An agent has a finite lifetime. It can be destroyed either by its initiator, 
or by any site if it is supposed to have traveled too long. After an agent has been 
destroyed, the outside will eventually create a new one (with possibly another 
initiator). Once created, an agent moves from a processor to a neighbor of this 
processor, following a non-corruptible rule. More precisely, each node has per- 
manently in its ROM some code, which is the only non-corruptible information 
and does not depend on any particular application. Moreover, it is simple and 
compact. 

When an agent reaches a node (by creation or by transmission), the code 
carried by the agent is installed. The code is in two parts, agent rules and ap- 
plication (or algorithm) rules. Agent rules are the only applicable rules when an 
agent is present, and, on the contrary, application rules are the only applicable 
rule when it is not. Agent code execution is always finite and ends by a call to 
the agent circulation code, responsible for transmitting the agent to another site. 
The ability of the agent to overwrite a corrupted application code (or simply to 
patch the code in a less severe fault model), motivates the inclusion of code in 
the agent. Because it is supposed to travel, an agent (code and briefcase) can be 
corrupted. 

In our agent model, there are two crucial properties, non-interference and 
stabilization. The non-interference property expresses that if no failure (corrup- 
tion) occurred after the creation of the previous agent, after the destruction of 
this agent, the controlled application must not behave differently whether or not 
a new agent is present in the network. The property involves that the application 
cannot use the agents for solving its problem and that the agents cannot perform 
resets in a predetermined global configuration. The stabilization property states 
that if a failure occurred, after a finite number of agent creations, the application 
behaves accordingly to its specification. 

Agent-stabilization has some advantages over classic self-stabilization. With 
self-stabilization, even if the stabilization time can be computed, it says nothing 
about the effective time needed for stabilization, which strongly depends on the 
communication delays and, consequently on the global traffic in the network. At 
the contrary, if agents are implemented in a way that guarantees small commu- 
nication delays (as high priority messages for instance), an efficient bound on 
agent-stabilization time can be given, because such a bound depends only on 
the time of the network traversal by the fast agent, plus the time between two 
successive agent creations, that can be tailored to the actual need. A second ad- 
vantage is that agent-stabilization is more powerful than self-stabilization. Any 
self-stabilizing solution is agent-stabilizing, with agents doing nothing, but some 
impossibility results from self-stabilization theory can be bypassed with agents, 
as we will show in the sequel. 
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2 The Specification 

The system is a set P of n communicating entities that we call processors, and 
a relation neighbors C P x P. To each processor p in P is associated some in- 
ternal variables, an algorithm and a communication register (rp can have a 
composite type). Each processor p G P has a read and write access to its register 
rp and a read access to its neighbor’s communication registers. The local state 
of a processor is the set of the values of the internal variables and of its register. 
A vector whose components are the local states of every processor of the system 
is the configuration of the system. 

The algorithm is defined as a set of guarded rules, of the form Guardi — 
Actioui where Guardi is a predicate on the local variables, the communica- 
tion register and the communication registers of the processor’s neighbors, and 
ActioUi is a list of instructions which could modify the communication register 
and/or the local variables of the processor. A rule is said to be enabled if its 
guard is true and disabled if its guard is false. 

The system can evolve from a configuration to another by applying at some 
processor p an atomic step. For defining the computation rules of a processor, we 
need now to look more precisely at what is held in the communication register. 
It holds two parts : an algorithm-specific part including all information that the 
processor has to communicate to its neighbors and an agent-specific part. The 
agent-specific part of the communication register is composed of 6 fields : Agent 
Code, Algorithm Code, Briefcase, Next, Prev and Present. 

The fields Agent Code and Algorithm Code hold two finite sets of guarded 
rules, depending on the algorithm ; the Briefcase is a field of constant size de- 
pending on the algorithm ; the fields Next and Prev are either T or pointers to 
the neighbors of the processor (they point respectively to the previous processor 
which had the agent and to the next which will have it), the field Present is a 
boolean and indicates whether or not the agent is present at the processor. 

An atomic step for a processor p is the execution by p of one of the following 
rules : 

Agent step: if Presentp is true, select and compute one of the enabled rules 
held by the Agent Code part of the register, if Nextp is not T, install the 
agent in Nextp from p ; set Presentp to false. 

Algorithm step: if Presentp is false, select and compute one of the enabled 
rules held by the Algorithm Code part of the register ; 

Agent installation: spontaneously install the agent in p from T ; 

Installing an agent in p from q is performing the following algorithm : 

1. Set Presentp to true; 

2. Set Prevp to q; 

3. If <7 is T, instantiate Algorithm Code with a fresh copy in Agent Code; 

4. If q is not T, copy the values of Brief caseq, AlgorithmC odeq and 
AgentGodcq from q to p. 
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An execution with initial configuration C is an alternate sequence of con- 
figurations and atomic steps, each configuration following an atomic step being 
obtained by the computation of the atomic step from the previous configuration, 
C being the first configuration of the sequence. We will consider here only fair 
executions. A fair execution is an execution such that if a guard of a processor P 
is continuously enabled, this processor will eventually perform an atomic step. 
System executions satisfy two axioms that we describe here : 

Uniqueness of creation. The atomic step Agent Installation can be performed 
only on a configuration where for all processor p, Presentp is false ; 
Liveness of creation. In each execution there is infinitely many configurations 
in which an agent is present; 

For the sake of clarity, we define a special alternate sequence of configura- 
tions and atomic steps, which is not an execution with respect to the two axioms, 
but that we call an agent-free execution and which is used to define the agent- 
stabilizing notion below. An agent-free execution is an alternated sequence of 
configurations and atomic steps, such that the initial configuration of the se- 
quence satisfies ^Presentp for all p and only algorithm steps are computed in 
this sequence. The set of agent-free executions starting at L (where L is a set of 
configurations) is the collection of agent-free executions such that the initial con- 
figuration is in L. We define also the agent-free projection of an execution : it is 
the alternate sequence of configurations and atomic steps, where the agent-part 
of the registers is masked and agent-steps are ignored. 

Finally, we define the type of failures that we will consider by defining initial 
configurations. There is no constraint on an initial configuration, except the fact 
that agent circulation rules must be correct (non corruptible). 

Let 5 be a specification (that is a predicate on executions). A system is 
agent-stabilizing for S iff there exists a subset C of the set of configurations 
satisfying : 

Convergence. For every configuration C, for every execution starting at C, 
the system reaches a configuration L in C ; 

Correctness. For very configuration L in C, every execution with L as initial 
configuration satisfies the specification S ; 

Independence. For very configuration L in C, every agent-free execution with 
L as initial configuration satisfies the specification S ; 

Non-interference. For every configuration L in £, the agent-free projection of 
every execution with L as initial configuration satisfies the specification S ; 
Finite time to live. Every execution has the property that, after a bounded 
number of agent-steps, an agent is always removed from the system. 

Convergence and correctness parts of this definition meet the traditional def- 
inition of a self-stabilizing algorithm. The independence part ensures that the 
algorithm will not rely on the agent to give a service : e.g. an agent-stabilizing 
system yielding a token circulation cannot define the token as the agent. Thus, an 
agent is indeed a tool to gain convergence, not to ensure correctness. Moreover, 
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the non-interference property enforces the independence property by ensuring 
that an agent cannot perturb the behavior of a fault-free system. Finally, the 
finite time to live property ensures that an agent will not stay in the system 
forever and that an agent is a sporadic tool, used only from time to time. 

3 General Properties of Agent-Stabilizing Systems 

The study of the convergence property of an agent-stabilizing system is as diffi- 
cult as the convergence property of a self-stabilizing system, because executions 
can start from any initial configuration. Moreover, the traditional fault model 
of self-stabilization does not include the corruption of the code. Here, we accept 
some code corruption (algorithm code corruption), making the study still harder. 

We will prove in this part that without loss of generality, for proving the 
convergence property of an agent-stabilizing system, the set of initial configura- 
tions can be restricted to those in which code is correct and there is exactly one 
agent, just “correctly” installed at some processor p. 

Theorem 1. Assume the finite time to live property. Then, for proving conver- 
gence of an agent-stabilizing system, it is sufficient to consider as initial config- 
urations, those following the installation of a unique agent. 

Sketch of Proof. Let C be an arbitrary configuration with n processors and p 
agents at processors P\,...,Pp (thus, Presenti is true for i G {Pj, 1 < j < p). 
Let £ = C s\Ci ... be an arbitrary execution with initial configuration C . £ is 
fair, thus there are infinitely many agent steps in the execution £. The property 
“finite time to live” is assumed, thus the execution reaches necessarily a configu- 
ration in which an agent is removed from the system, leading to a configuration 
C , with one agent less than in C. 

The axiom “uniqueness of creation” involve that while there is at least one 
agent in the system (thus one processor verifying “Presentp is true”), the agent 
installation step cannot be performed on any other processor in the system. By 
recursion on the number of agents in the system, the system reaches a configu- 
ration CP where there is no agents in the system (potentially, = C if there 
was no agent in the initial configuration). 

The execution £ also satisfies the axiom “liveness of creation” ; there is a 
finite number of configurations between C and , thus at least one agent will be 
present in a configuration A after . The only atomic step which could install 
an agent is the “Agent installation” step, thus there is always a configuration A 
reachable from C where a single agent is just installed in the system. □ 

4 Agent Circulation Rules 

One of the main goals of the agent is to traverse the network. We will present 
here three algorithms for achieving a complete graph traversal. Two of them 
(depth first circulation and switch algorithm) hold for arbitrary graphs, whereas 
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the third is an anonymous ring traversal. We will also show that, in any agent- 
system using one of these algorithms as agent circulation rules, it is easy to 
implement the finite time to live property. 

Depth First Circulation. 

The algorithm assumes that every processor of the system is tagged with a color 
black or white. Assume that agent has just been created. The briefcase carries 
the direction of the traversal (Up or Down). The processor visited by the agent 
for the first time (with a Down Briefcase) flips its color, chooses as a parent 
the previous processor, chooses a neighbor tagged with a color different from its 
new color and sends the agent to this neighbor, with the briefcase Down. When 
the agent reaches a dead end (i.e. is sent down to a processor with no neighbors 
with the right color), it is sent back by this processor with Up in the Briefcase. 
When a processor receives an agent with Up in the Briefcase, if it is a dead end, 
it sends the agent Up, if not, it selects a new neighbor to color and sends the 
agent down. 

Moreover, the briefcase holds a hop counter, namely the number of steps since 
the installation. This counter is incremented each time the agent moves. If the 
agent makes more than 4|if| steps (a bound greater than 2 times the number of 
edges of the depth first spanning tree), the agent is destroyed (next processor is 
T). More formally, in figureGlthe description of the traversal algorithm is given. 
For this, we define : 

, ( -L ayp G Fi, color ^ c . , black = white 

&{ri,c)=< , , / r, , and also — — ,, , 

(p s.t. coloTp = c and p G 1 i else white = black 

We claim that this algorithm produces a complete traversal of the system 
even in the case of transient failures because, as we will show in theorem 0 at 
each execution of rule 1, the number of processors of the same color strictly 
increases, until reaching the total number of processors in the system. 

Theorem 2. Every fair agent-system using the depth first circulation algorithm 
of figure m as agent circulation algorithm satisfies the finite time to live property. 

The proof is straightforward and uses the fairness of the system for showing 
that every agent increments its hop counter until the bound is reached. 

Remark that the bound is large enough to allow a complete depth first traver- 
sal of the system : it is 2 times the number of edges, and in a depth first traversal, 
each edge of the spanning tree is used 2 times (one time for going down to the 
leaves, one time for going up to the root). We will now prove that this algorithm 
produces a complete graph traversal when there is a single agent in the system. 

Let p,q G P. We say that p ~ g iff colorp = colorq and (p, q) G neighbours. 
In the sequel, the term “connected components” refers to the connected compo- 
nents of the graph of the relation 

Lemma 1. Let C he a configuration with m connected components (Mi, . . . Mm) 
and let p be a processor with an agent just correctly installed. Let Mj be the con- 
nected component of p. 
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Local: {parenti € Fi U{-L} — colovi € {black, white}) 

Briefcase: {Brief casa £ {{down, up}, hop), 1 < hop < 2.|_B|) 

Constant rule: At each rule, we add the following statement: 
if hop > 2|_B| then Nexti <— _L. 

— Agent rules : 

1. Previ =-L \/{{Briefcasei = {down, hop)) A {0{Pi, color Previ) ^-L)) — ^ 

' parenti <— PreVi 
Nexti ^ 0{Pi, colori) 

< col or i <— col or i 
if Previ =_L, hop — 0 
^ Brief casci <— {down, hop + 1) 

(* going down *) 

2. {Previ yf-L) A {Brief casd = {down, hop)) A {0{Pi, color Previ )=A)^ 
parenti <— Previ 
Nexti <— Previ 
Brief casei <— {up, hop + 1) 
color i <— color i 

(* reaching a dead-end *) 

3. {Previ yf-L) A {Brief casCi = {up, hop)) A {0{Pi, colori) f=l.) — > 

( Nexti <— 9 {Pi, colori) 

\ Brief casCi <— {down, hop -|- 1) 

(* done one branch, go down for others *) 

4. {Previ ^_L) A {Brief casa = {up, hop)) A {9{Pi, colori) =_L) — > 

J Nexti <— parenti 
\ Brief casa <— {up, hop + 1) 

(* completely done one node, going up*) 



Fig. 1. Depth first agent traversal algorithm for anonymous networks. 



In every execution £ = C s\ . . . C . . . there is a configuration such that the 
agent is in p, has visited all processors in Mj and every processor in Mj has 
flipped its color. 

Theorem 3. Consider the same configuration C, as in ZemmaQl 

In every execution £ = C s\ . . . C . . . there is a configuration such that there 
is no agent, one agent has visited all processors of the system and all processors 
in C have the same color. 

Sketch of Proof. Lemma Estates that from an initial configuration C, the system 
reaches a configuration Ci, with all processors of Mj (the connected component 
of p) visited and having flipped their color. If there is only one connected com- 
ponent in C, then the theorem is true (rule 4 set Next to _L in p and when p 
makes an atomic step, it removes the agent). If there are m > I connected com- 
ponents in C, then there are 1 < n < to connected components in C\ (because 
all processors of Mj flipped their color, thus merged with the neighbor connected 
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component). Thus, lemma ^ applies again in a configuration with the agent 
removed from p (rule 4) and installed in a processor p' (liveness of the agent). 

Recursion on the number of connected components in C, proves the theo- 
rem. □ 

The Switch Algorithm. The switch algorithm is another anonymous graph 
traversal algorithm. The algorithm presented in is defined for semi-uniform 
Eulerian networks, but as we will show now, it works for uniform Eulerian net- 
works with the agent-stabilizing system assumptions. This algorithm is an im- 
provement of the depth first circulation because here circulation is decided purely 
locally, without communicating with the neighbors of the processor which holds 
the agent. Moreover, in the depth first token circulation, the agent performs a 
complete graph traversal only after all connected components merged into one. 
Here, the algorithm converges with only one agent installation. This has a price : 
an agent circulation round takes 2 x n ■ \E\ agent steps, even after stabilization, 
while an agent circulation round with the depth first circulation takes only 2\E'\ 
agent steps (E' being the set of covering edges of the depth first tree) . 

The idea of the algorithm is the following. Suppose that each node has ordered 
its (outgoing) edges and let PointsTohe a pointer on some of these edges. When 
the token is at some node, it is passed to the neighbor pointed by PointsTo, 
then PointsT o is set to the next edge in the list (which is managed circularly) . 

In j1 2\ . it is proved that, starting with a unique token, this algorithm even- 
tually performs an Eulerian traversal of the network. 

We propose here a slight modification of this algorithm so that it performs a 
complete traversal of an anonymous Eulerian agent-system. The initiator (with 
Prev =T) will set the Briefcase to 0 and every time the agent performs an agent 
step, Briefcase will be incremented by 1. If a processor holds the agent and if 
Briefcase is greater than 2xn-\E\ (where E is the set of edges of the system and 
n is the number of processors in the system), then this processor sets Next to 
T, destroying the agent. The agent rules are presented more formally in figure El 
We will now prove that this circulation algorithm satisfies the finite time to live 
property. 

Theorem 4. Every fair agent-system using the switch circulation algorithm of 
figure as an agent circulation algorithm satisfies the Finite Time to Live prop- 
erty. 

The proof is straightforward and uses the fairness of the system for showing 
that the Briefcase counter is incremented until reaching the bound. 

Theorem 5. The switch circulation algorithm is a circulation algorithm, which 
performs a complete graph traversal from a configuration in which an agent in- 
stallation follows a configuration without agents. 

We use the convergence property of the switch, proven in 1121 , and the con- 
vergence time, which is n ■ \E\ to state that after n ■ \E\ agent steps, every edge, 
thus every node of the system has been visited by the agent. 
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Local: {PointsToi £ Pi 

Briefcase: (0 < Briefcase < 2 x n • \E\ + 1) 

Operator: Inc{PointsToi) increments PointsToi by 1 modulo Degreei. 
— Agent rules : 

{ Inc{PointsToi) 

Nexti = PointsToi 
Brief casei = 0 

2. (Prev 7 ^-L) A (Briefcase < 2 x n ■ |_B|) — > 

{ Inc(PointsToi) 

Nexti = PointsToi 
Brief casei = Brief casei + 1 

3. (Prev 7 ^-L) A (Briefcase > 2 x n ■ |_B|) — > { Nexti =_L 



Fig. 2. Switch circulation algorithm 



Anonymous Oriented Ring Circulation. 

In such a topology circulation is trivial : the agent rule consists of sending the 
agent to the successor and counting in the Briefcase the number of processors 
already seen. If this number is greater than or equal to the number of processors 
in the system, the agent is destroyed. With this algorithm, the finite time to 
live property is obvious, as is the correctness property of the agent circulation : 
starting from a configuration with only one agent, just installed at some proces- 
sor p, this agent visits all the processors in the system. Furthermore, every new 
agent is eventually destroyed by its initiator. 



5 Examples Illustrating the Power of Agents 

We will first present general transformations which, given a self-stabilizing so- 
lution for semi-uniform networks or for networks with distinct identifiers, auto- 
matically transform them into agent-stabilizing solutions for the same problem, 
but for completely anonymous networks (uniform networks). The basic idea is 
that a generic agent algorithm, with an empty set of algorithm rules, either dis- 
tinguishes a particular processor or gives distinct identifiers in an anonymous 
networks. By adding the rules of the self-stabilizing solution as algorithm rules 
one gets an agent stabilizing solution for anonymous network. 

Let us briefly describe the ideas of the transformation from uniform to semi- 
uniform. Note that an algorithm that always declares the processor in which the 
agent is created as the leader, does not satisfies the non-interference property. 
Thus, the agent must perform a complete traversal (using the circulation rules) 
for checking whether or not there is only one distinguished processor. If several 
distinguished processors (resulting from a corruption) are found, all but the first 
are canceled and if no one is found, the initiator becomes distinguished. 
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Note that there is no algorithm rule in this solution, then it can be combined 
with any self-stabilizing solution for semi-uniform networks. The main advan- 
tage of automatic transformations is that there are automatic. But they very 
seldom yield the most efficient solution. In the second part, we will give another 
illustration of the power of agents, by giving efficient specific solutions to some 
classical problems. 

5.1 General Transformers 

Leader Election within a Uniform Network. This transformer allows a 
self-stabilizing system designed for semi-uniform networks to work for uniform 
networks. This is achieved by solving the Leader election problem. 

Starting at the initiator, the agent performs a complete graph traversal us- 
ing the algorithm in figure Q We proved that this algorithm eventually performs 
a complete graph traversal, and it is obvious that after the first traversal, no 
processor has prev\p =T Then we can ensure that there is no processor with 
parentp =T after the first agent traversal, by adding parentp = Prev to the 
action part of rule 4 (the rule which is used when every neighbor of this node 
has been visited). 

The algorithm is simple : the agent carries in the briefcase the information 
“did I meet a leader yet” (boolean value). If this information is true when the 
agent meets a leader, the leader is destroyed, if the agent meets the initiator 
(which is the only processor which has parenti =T for every agent installation 
after the first complete traversal) and this information is false, this processor 
becomes the leader. 

Lemma 2. Consider a configuration C of Depth first agent traversal, in which 
all processors have the same color i and an agent has just been installed at pro- 
cessor p. The agent performs a complete graph traversal and rule 4 is executed 
exactly once by each processor before the agent is removed from the system. 

Theorem 6. The leader election algorithm is agent- stabilizing. 

Sketch of Proof. The agent has a finite time to live, as it was proven in theorem 
0 It is then legitimate to use the result of theorem 0 and to consider for the 
convergence property a configuration with one agent in the system. This agent 
will eventually traverse every node and come back to the initiator (theorem . 
According to algorithm rules, there will remain one and only one leader in the 
system, thus convergence is verified. Consider a legitimate configuration. The 
algorithm rules are empty (thus the agent-system satisfies independence), and 
when an agent is installed, it will perform a complete graph traversal, visiting 
every node and executing rule 4 at each node exactly once (lemma El applies, 
the configuration is legitimate, thus every node is of the same color). Thus, it 
will visit the leader and note it in the briefcase. Then, the initiator will not be 
designated as the leader and the correctness property is satisfied. Moreover, the 
leader stays the leader, thus non-interference property is also verified. □ 



Easy Stabilization with an Agent 



45 



Network Naming. The naming problem is to assign to each processor of the 
network a unique identity. We consider that there is a set of identifiers (at least 
as large as the network size), and that identities should be chosen within this 
set. The agent carries in the briefcase a boolean tag for each element of this set. 
When the agent reaches a processor p, if p uses an identifier already used, its 
identifier is set to the first available identifier, and this identifier is tagged in the 
briefcase ; if p uses an identifier which is not yet used, this identifier is tagged 
in the briefcase. The agent performs a complete graph traversal according to 
figure Q algorithm and actions are executed when the guard of rule 4 is true. 

Theorem 7. The network naming algorithm is agent- stabilizing. 

Sketch of Proof. As it was said before, the Finite time to live property is inher- 
ited from the agent circulation algorithm. Like for leader election, independence 
is obvious since the algorithm has no rule. For proving the correctness property, 
we have to show that the set of legitimate configurations is closed. Let us define 
legitimate configurations as configurations such that Vp, g G P,tagp yf tagq. In a 
token circulation, rule 4 of depth first traversal applies at most once (theorem|^. 
Thus, the agent will never meet a processor with a tag equal to an already used 
tag. Thus, the rule will never be applied and the configuration will not change 
with respect to the tags. Let us consider an arbitrary configuration C with an 
agent just installed at processor p. If this configuration is not legitimate, then 
there exists a processor p and a processor q tagged by the same name. Theorem 0 
states that in every execution with initial configuration C' , there is a configura- 
tion C” reached after the agent visited every node. When this agent has been 
installed, Briefcase was set to nil ; it first reached either p or q. If it reached p, 
tapp is an element of Briefcase, and when it reaches q, tapq is set to a no yet 
used tag. □ 

5.2 Local Mutual Exclusion within a Uniform Network 

With the previous algorithm, we can easily transform every self-stabilizing algo- 
rithm designed for networks with identities to an agent-stabilizing solution for 
anonymous networks. Note that we assume that the agent can know the different 
Ids already seen in the network (briefcase must have nlog(n) bits). Obviously, in 
specific cases, better solutions can be found. We propose now a solution for solv- 
ing Local Mutual Exclusion within a uniform anonymous network. Local Mutual 
Exclusion is a generalization of the dining philosophers problem [7| , and can be 
defined as ’’having a notion of privilege on processors, two neighbors cannot be 
privileged at the same time” . 

The solution is based on the self-stabilizing local mutual exclusion algorithm 
proposed in (Q. Every processor has a variable b in range [0; — 1]. Assuming 

that neighbors have different b values, then a cyclic comparison modulo of 
the b values yields an acyclic orientation of edges (from greater to lesser) . If we 
consider that every sink (i.e. a node with only ingoing edges) is privileged, then 
two neighbors never can be privileged simultaneously. To pass the privilege, b is 
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set to the maximum of the neighbors value plus one (modulo n^), and in Q it is 
shown that this rule provides a cyclic selection of every processor in the system. 

It is also shown that this solution is based upon the fact that a processor 
has a value different from the values of its neighbors, and identifiers are used 
to implement that. An agent can be easily used in an anonymous network, for 
creating asymmetry between a processor and its neighbors. 



o Algorithm rules 

1 . 6 < — > 6 ^ max\^^l^{b\g + 1 ) 

o Agent rules 

1. Every rules of EgureQ composed with 

2. 3q e Fp s.t. b\q = b — > b <— (6|g + 1) 



Fig. 3. Anonymous uniform Local mutual exclusion agent-stabilizing algorithm. 



Theorem 8. The local mutual exclusion algorithm of figure\^is agent-stabilizing. 

Sketch of Proof. This algorithm uses the depth first circulation algorithm, thus 
the Finite Time to live property is satisfied. Then, according to theorem^ we can 
consider for proving convergence a configuration with a single agent just installed 
at some node. Let C be such a configuration. Either, Vp S P,Vg S Pp,b\p ^ b\q 
and it was proven in P that the algorithm converges or 3p € P, g G Pp,b\p = b\q. 
In this case, the agent will eventually visit p and q (theorem Ej), either p first or 
q first. If the first visited processor is p (resp. q), agent rule 2 will be enabled, 
thus the system will reach a configuration in which 3p G P, g G Pp,b\p = b\q is 
false. That will remain false in the sequel of the execution. Then convergence is 
proven. 

Starting from a legitimate configuration, the agent does not modify the b bit 
of any processor (a legitimate configuration satisfies Vp G P,Vg G Pp,b\p ^ b\q). 
Correctness of the algorithm has been proven in |H . Thus every execution from 
a legitimate initial configuration is correct. In such an execution, the agent does 
not modify the behavior of the algorithm, then non interference is satisfied. Fi- 
nally, the independence property is also satisfied since every agent-free execution 
with a legitimate initial configuration is correct. □ 

5.3 Token Circulation on a Bidirectional Uniform Oriented Ring 

As it was shown in P a simple non self-stabilizing algorithm for achieving token 
circulation on a ring with a leader can be made easily agent-stabilizing, assuming 
that the agent is always installed at the leader. We will prove that this assump- 
tion can be removed. Furthermore, we propose a token circulation algorithm on 
a bidirectional uniform oriented ring which is agent-stabilizing. 



Easy Stabilization with an Agent 



47 



Predicates: 

sensitive{p) = &|p+i = b\p / 6|p-i 
HasToken{p) = &|p-i = 1 A &|p = 0, 

Context{p) = Tokens if HasToken{p), NoTokens else, 
o Algorithm rules 

(a) sensitive{p) A 6|p = 1 — > h\p = 0 

(b) sensitive{p) A 6|p = 0 — > b\p = 1 
o Patch 1 Rules 

color\p 7 ^ Seen A sensitive{p) A &|p = 1 — > 6|p = 0 

color\p 7 ^ Seen A sensitive{p) A &|p = 0 — > &|p = 1, color\p = Seen 

o Patch 2 Rules 0 

o Agent rules (Rules of ring circulation, in conjunction with) : 

1. Prev =_L — > 

color\p = NotSeen, Brief case.context = Contextijp), install Patch 1. 

2. Prev 7 ^_L ABriefease.hop = n — > 

if Brief case.context = NoTokens and color\p = NotSeen, 
create a new token 

elsif Brief case.context — Tokens and color\p = Seen, 
install Patch 2, Brief case. hop = 0 
else install Algorithm. 

3. Prev t^-L ABrief case.context = Tokens — > 
if HasToken{p) destroy the token 

4. Prev t^-L ABrief case.context = NoTokens — > 

Brief case.context = Contextijp) 



Fig. 4. Anonymous uniform token circulation agent-stabilizing algorithm 



The idea is the following : Let 6 be a boolean variable stored at each node. 
Let us say that processor p has the token in configuration C if in C, b\p = 0 
and b\p-i = 1 (the ring is oriented). Let us say that a processor is sensitive if it 
has the same b value as its successor and a b value different from its predecessor 
(then we allow the token to be or not to be sensitive). The algorithm performs 
the following rules : If a processor p is sensitive, it is allowed to flip its bit. Let us 
denote a configuration by a word, like 01. ..10, where a digit represents the value 
of the b bit of a processor. A legitimate configuration is of the form 1^’0"“^. 

This algorithm is not self-stabilizing. The solution that we propose here to 
stabilize uses agents. The first idea is simple : an agent traverses the ring in the 
same direction as the tokens, carrying in the briefcase the information “did I see 
a token until now” . If so, every other token can be destroyed by setting the b bit 
to 1. When the agent comes back to the Initiator, if there is at least one token 
in the system (Briefcase = Tokens), then the agent simply disappears, if not a 
token is created at the Initiator (flipping the bit) and then the agent disappears. 

The circulation algorithm used here is the oriented anonymous ring circula- 
tion. It is detected that the agent has performed a complete traversal by checking 
if Briefcase has the value n when the agent is back to the Initiator. This so- 
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lution works only if tokens do not move. Then the next problem to solve is to 
ensure that the agent will meet the supplementary tokens, if any. For that, the 
initiator receives a patch from the agent : instead of letting every token pass 
through him while the agent is inside the ring, it allows only one token to pass 
and freezes the others. 

The last problem is: if the agent saw one or more tokens while another to- 
ken passed through the initiator, when the agent will come back to the initiator 
there will be two tokens in the system. But this situation can be detected by the 
initiator. One solution could be not to let the agent overtake a token, another 
not to let a token pass through the initiator. But both would fail with the non- 
interference requirement. A solution matching the requirements is to give to the 
initiator the responsibility to destroy the supplementary token when it reaches 
it. Note that, if a bounded convergence time is mandatory (related to the speed 
of the agent), it could be better to let the initiator freeze one token while the 
agent performs a second round for deleting the other. The solution is presented 
more formally in figure^ 

Theorem 9. The agent-algorithm of figure^ is agent- stabilizing 

Sketch of Proof. Let us first prove the independence property : a legitimate con- 
figuration is defined by In such a configuration, if p is greater than 1, 

the system can choose between algorithm rules a and b and reaches either the 
configuration ip+1o"“P“ 1 (the token moves one step backward) or the configu- 
ration iP-io"“P+i (the token does not move). If p is I, then the system is only 
allowed to reach the configuration 110"“^ (the token moves backward), and if p 
is n — 1 the system is only allowed to reach the configuration 1"“^00. In every 
reachable configuration the number of tokens is one and every processor of the 
system eventually has the token. So, starting from a legitimate configuration if 
there are no agent steps, the behavior is correct. 

We prove then the non-interference property : if there is exactly one token 
in the system, either it will go through p and the agent will never meet this 
token. Brief case. context is NoTokens, but the agent is silently destroyed, or 
the agent meets the token, Brief case. context is Tokens when the agent reaches 
back p and the agent is silently destroyed : the configuration remains legitimate. 
Then, starting from a legitimate configuration, the agent has no effect on the 
behavior of the system (the token is neither moved nor blocked). This proves 
also the correctness property of the algorithm. 

The finite time to live property is a consequence of the circulation algo- 
rithm (an agent performs at most 2 rounds of the ring). For the convergence 
property, we consider an initial configuration with one agent installed at pro- 
cessor p. If there are no tokens in the system, when the agent reaches back to 
p. Brief case. context is NoTokens., thus rule 3 applies and the following con- 
figuration is legitimate ; if there are more than one token in the system, either 
the agent meets the supplementary tokens and then they are destroyed, or these 
tokens are frozen (they cannot move to the next processor) by the first patched 
algorithm rule. At most one token could pass through the Initiator during the 
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first round. If the agent has met a token and a token has pass through the ini- 
tiator, the agent will make another round, destroying one token and no token 
could pass through the initiator within this round. Thus, the agent has to meet 
the tokens, and every supplementary token is destroyed. Thus, the configuration 
following the configuration after the agent came back to the initiator for the 
second time if necessary is legitimate. □ 



6 Conclusions 

In this paper, we have formally defined the notion of agent (in the context of 
stabilizing systems), and agent-stabilizing systems. We proposed a set of axioms 
for having a powerful tool for dealing with failure tolerance. We illustrated the 
use and the power of this tool on two static problems, which provide a general 
transformation scheme and two dynamic classical problems. 

The agent is a special message that circulates in the system, checks its con- 
sistency and mends it if faults are detected. It is created by a lower layer which 
was not described here, but whose properties are well defined (uniqueness and 
liveness of creation). We also defined a set of constraints (properties), that the 
system must satisfy to be considered as agent-stabilizing. In addition to the 
classical correctness and convergence properties, we added independence and 
non-interference notions. These notions ensure that the system will not rely on 
the agent in the (normal) case without failure, and that the presence of the agent 
in the normal case is not disturbing the system. 

We provided a general transformation for self-stabilizing algorithms designed 
for non-anonymous networks into agent-stabilizing algorithms for anonymous 
networks. Finally, We have shown that the agent can also be used to transform 
non-stabilizing algorithms, like the token circulation on anonymous ring with 
two states per processor, into stabilizing ones. 
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Abstract. Routing messages in a network is often based on the assump- 
tion that each link, and so each path, in the network is bidirectional. The 
two directions of a path are employed in routing messages as follows. One 
direction is used by the nodes in the path to forward messages to their 
destination at the end of the path, and the other direction is used by 
the destination to inform the nodes in the path that this path does lead 
to the destination. Clearly, routing messages is more difficult in directed 
networks where links are unidirectional. (Examples of such networks are 
mobile ad-hoc networks and satellite networks.) In this paper, we present 
the first stabilizing protocol for routing messages in directed networks. 
We keep our presentation manageable by dividing it into three (relatively 
simple) steps. In the first step, we develop an arbitrary directed network 
where each node broadcasts to every reachable node in the network. In 
the second step, we enhance the network such that each node broadcasts 
its shortest distance to the destination. In the third step, we enhance the 
network further such that each node can determine its best neighbor for 
reaching the destination. 



1 Introduction 

The routing of messages from a source node to a destination node is a funda- 
mental problem in computing networks. In general, routing protocols are divided 
into two broad categories US]: link-state protocols and distance- vector protocols. 
In link-state protocols, each node broadcasts a list of its neighbors to all nodes in 
the network. Each node then builds in its memory the topology of the network, 
and computes the shortest path to each destination. A total of O(n^) storage is 
required to store this topology. Examples of link-state protocols include |7I1 . In 
distance-vector protocols, each node forwards to all neighboring nodes a vector 
with its distance to each destination. In this way, only 0(n) storage is required. 
Examples of distance-vector protocols include 

There is an implicit assumption in these protocols. It is assumed that each 
link, and therefore each path, is bidirectional. The forward path from a source 
node to a destination node is discovered by sending routing messages along this 
path, and the backward path is used to inform the source of the existence of 



A.K. Datta and T. Herman (Eds.): WSS 2001, LNCS 2194, pp. 51-^^ 2001. 
(c) Springer- Verlag Berlin Heidelberg 2001 



52 



Jorge A. Cobb and Mohamed G. Gouda 



the forward path. However, new technologies are producing directed networks, 
that is, networks with unidirectional links. An example is networks with satellite 
links, since these links allow only unidirectional traffic Another example 

is mobile wireless networks, in particular, ad-hoc networks lanni. In these net- 
works, nodes communicate with each other via radio links. These links may be 
unidirectional for several reasons. For example, there might be a disparity in the 
transmission power of two neighboring nodes. Thus, only one node may be able 
to receive messages from the other. In addition, if there is more interference at 
the vicinity of one node, then the node with the higher interference is unable to 
receive messages from its neighboring node. 

Routing protocols that assume bidirectional links will fail in a directed net- 
work Birni . That is, the shortest path to a destination will not be found when this 
path contains unidirectional links. To remedy this shortcoming, new routing pro- 
tocols have been developed that take unidirectional links into account EHFITUI. 
Because routing protocols are essential in computing networks, it is desirable for 
them to be stabilizing. A protocol is said to be stabilizing iff it converges to a nor- 
mal operating state starting from any arbitrary state m- Although stabilizing 
routing protocols exist for undirected networks, the protocols in PISEIEI! for 
directed networks have not been shown to be stabilizing. In PEI, “tunnels” need 
to be configured to go around single unidirectional links. Furthermore, this tech- 
nique is not applicable to networks with an arbitrary number of unidirectional 
links. In PdH, the stabilization of the protocol is not addressed (unbounded se- 
quence numbers are used, which may require an unbounded stabilization time.) 

In this paper, we present the first stabilizing protocol for routing messages 
in directed networks. We keep our presentation manageable by dividing it into 
three (relatively simple) steps. In the first step, we develop an arbitrary directed 
network where each node broadcasts to every reachable node in the network. 
In the second step, we enhance the network such that each node broadcasts its 
shortest distance to the destination. In the third step, we enhance the network 
further such that each node can determine its best neighbor for reaching the 
destination. In our network, each node is a process, and for simplicity, processes 
communicate with each other via shared memory. A message passing implemen- 
tation is briefly discussed in Section 0 In addition, we assume that the cost 
of each link in the network is one. Thus, the shortest path (i.e., with the least 
number of links) is found to each destination. It is straightforward to modify our 
network to work with arbitrary positive costs assigned to each link. 



2 Directed Networks 

We consider a network of communicating processes that can be represented by a 
directed graph. In this directed graph, each node represents a distinct process in 
the network, and each directed edge from a process u to a process v represents 
a possible flow of information from u to v. Specifically, a directed edge from a 
process u to a process v indicates that each action of v can read the variables 
of both u and v, but can write only the variables of v. Thus, the existence of 
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a directed path from a process u to a process v indicates that information can 
flow from u to V (via the intermediate processes in the directed path), and the 
lack of a directed path from a process m to a process v indicates that information 
cannot flow from u to v. 

If there is a directed edge from a process u to a process v, then u is called a 
backward neighbor of v and v is called a forward neighbor of m. In every process 
u, two constant sets, B.u and F.u, are declared as follows. 

const B.u : set of identifiers of all backward neighbors of u 
F.u : set of identifiers of all forward neighbors of u 

Without loss of generality, we assume that for each process u in a network, both 
B.u and F.u are non-empty. Each process in the network has a unique identifier 
in the range 0 . . n — 1, where n is the number of processes in the network. Process 
0 is called the network root. 



3 Routing Trees 

Information needs to flow from every process in a directed network to the net- 
work root. To achieve this goal, every process u maintains the identifier of one of 
its forward neighbors, the one closest to the network root, in a variable named 
next.u. When the network reaches a stable state, the values of the next.u vari- 
ables define a directed rooted tree where all the directed paths lead to the net- 
work root. This tree is called a routing tree. 

Also, every process u maintains the length of (i.e., the number of edges in) 
the shortest directed path from u to the root in a variable called dist.u. When 
the network reaches a stable state, the value of each dist.u variable defines the 
length of the directed path from u to the root in the routing tree. 

In fact, not every process in the network can be in the routing tree for two 
reasons. First, if the network has no directed path from a process u to the root, 
then information cannot flow from u to the root, and u cannot be in the routing 
tree. Second, if the network has no directed path from any backward neighbor of 
the root to a process u, then u cannot be informed of whether there is a directed 
path from u to the root. In this case, no information flow will be attempted from 
u to the root, and u cannot be in the routing tree. 

From this discussion, u is in the routing tree of a network iff there is a back- 
ward neighbor v of the network root such that the network has a directed path 
from u to u and a directed path from v to u. The first path (from u to v) can 
be used as a route for the information flow from u to the root, and the second 
path (from v to u) can be used to inform u about the existence of the first path. 
When the network reaches a stable state, if a process u is in the routing tree, 
then the value of variable dist.u is in the range 0 . .n — 1; otherwise, the value 
of variable dist.u is n. 

As an example, consider the directed network in Fig. [D There is no directed 
path from processes 3 and 7 to process 0 (the root); thus, processes 3 and 7 
cannot be in the routing tree. Also, there is no directed path from a backward 
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neighbor of process 0 (i.e., from process 2) to processes 1 and 5; thus, processes 
1 and 5 cannot be in the routing tree. For each of the other processes (i.e. pro- 
cesses 2, 4, and 6), there is a backward neighbor of process 0 such that there 
are a directed path from the process to the backward neighbor and a directed 
path from the backward neighbor to the process. Thus, each of the processes 2, 
4, and 6 is in the routing tree. 




The routing tree for this network is shown in Fig. El In this figure, the val- 
ues of the two variables next.u and dist.u are written beside every process u. 
Also, the network edges that belong to the routing tree are shown as solid lines, 
whereas the other edges are shown as dashed lines. 



next. 1 = irrelevant 
dist. 1 = 8 






next.2 = 0 
dist. 2 



ext. 5 = irrelevant 
ist.5 = 8 



I I 



next. 4 = 2 
dist. 4 = 2 




next. 3 = irrelevant 
>-^dist.3 = 8 



^dist.7 = : 



next. 7 = irrelevant 



next. 6 = 4 
dist. 6 = 3 



Fig. 2. The routing tree of the network in Fig. ^ 
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Next, we discuss how to make the processes in a directed network maintain 
a routing tree. For simplicity, we divide our discussion into three steps. In the 
first step, we discuss how to make each process u broadcast a local value x.u 
to every other process (that can be reached via a directed path from u) in the 
directed network. In the second step, we enhance the network process such that 
each value x.u, which is broadcasted by process u, is the shortest distance from 
u to the root. When the modified network reaches a stable state, each process 
u knows the network distance vector that stores, for every process v in the net- 
work, the shortest distance from v to the network root. In the third step, each 
process u computes from its network distance vector the two values next.u and 
dist.u needed for maintaining the routing tree. These three steps are discussed 
in more detail in Sections 13 0 and □ 



4 Network Notation 



Before presenting our networks of processes, we first give a short overview of the 
notation that we use in specifying our processes. For simplicity, our processes are 
specified using a shared memory notation. In particular, each process is specified 
by a set of constants, a set of variables, a set of parameters, and a set of actions. 
A process is specified as follows. 



process <process name> 

const 



<constant name> : 


<type>. 


<constant name> : 


<type> 


<variable name> : 


<type>. 


<variable name> : 


<type> 


<parameter name> : 


<type>. 


<parameter name> : 


<type> 



begin 

< action > 

[] 



[] 

< action > 

end 

The constants declared in a process can be read, but not written, by the 
actions of that process. The variables declared in a process can be read by the 
actions of that process and the actions of forward neighbors of that process. The 
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variables declared in a process can be written only by the actions of that process. 
Parameters are discussed below. 

Every action in a process is of the form <guard> ^ <body>. The <guard> 
is a boolean expression over the constants, variables, and parameters declared 
in the process, and also over the variables declared in the backward neighbors of 
that process. The <body> is a sequence of assignment statements that update 
the variables of the process. 

Each parameter declared in a process is used as a shorthand to write a set of 
actions as one action. For example, if we have the following parameter definition, 

par g : 1 . . 3 

then the following action 

X = g X := X + g 

is a shorthand notation for the following three actions. 

X = 1 X := X + 1 

D 

X = 2 X := X + 2 

[] 

a; = 3 ^ x:=a; + 3 

An execution step of a network consists of evaluating the guards of all the 
actions of all processes, choosing one action whose guard evaluates to true, and 
executing the body of this action. An execution of a network consists of a se- 
quence of execution steps, which either never ends, or ends in a state where 
the guards of all the actions evaluate to false. We assume all executions of a 
network to be weakly fair, that is, an action whose guard is continuously true is 
eventually executed. 

5 Directed Broadcast 

In this section, we discuss a directed network where each process u computes 
an array X.u that has n elements, where n is the number of processes in the 
network. When the network reaches a stable state, the element X[v].u in 
array X.u has the value x.v that is local to process v. 

In this network, each process u maintains, along with each element X[v\.u, 
two corresponding elements: 

b[v].u = identifier of a backward neighbor of u from which u 
has read the latest value of X[v\.u 

d[v].u = length of (i.e., number of edges in) the directed path along 
which the value x.v (in v) is transmitted to X[v].u (in u) 

Note that if the value of d[v].u ever becomes n, then process u recognizes 
that it has not yet found a directed path from v to u and that the current value 
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of X[u].u is probably incorrect. Thus, if the network has no directed path from 
process v to process u, then the value of d[v].u stabilizes to n and the value of 
X[v].u stabilizes to a probably incorrect value. 

Each process u in the directed broadcast network is defined next. In this 
definition, we use the expression a 0 6 to mean min(a + b,n). 

process u : 0 . . n — 1 
const 

B.u 
F.u 
x.u 

var 

X.u 
b.u 
d.u 

par 

V 

w 

begin 

X[u].Uy^x.u V d[u].u^Q 

X[u].u := x.u; 
d[u].u := 0 

D 

v^u A b[v].u = w A {X[v].u ^ X[v].w V d[v].u ^ d[v].w (B 1) 

X[v].u := X[u].w; 
d[v].u := d[v].w 0 1 

[] 

V ^ u A d[v].w 0 1 < d[u].u ^ 

X[v].u := X[v].w; 
b[v].u := w; 
d[v].u := d[v].w 0 1 

end 



set of identifiers of all backward neighbors of u 
set of identifiers of all forward neighbors of u 
0 . .n {local constant in u} 

array [0 . . n — 1] of 0 . . n, 
array [0 . . n — 1] of B.u, 
array [0 . . n — 1] of 0 . . n 

0 . . n — 1, { any process in the network } 

B.u { any backward neighbor of m } 



Process u has three actions. In the first action, u ensures that X[u].u equals 
its local constant x.u and d[u].u is zero. In the second action, u recognizes that 
it has read the latest value of X[v].u from a backward neighbor w and so ensures 
that X[v].u equals X[u].w and d[v].u equals d[v].w 0 1. In the third action, u 
recognizes there is a shorter directed path from v to u along its backward neigh- 
bor w. In this case, u assigns to X[v\.u the value of X[v\.w, assigns to b[v].u the 
value of w, and assigns to d[v\.u the value d[v].w 0 1. 

This network maintains for every process u, a stabilizing, rooted, shortest- 
path spanning tree T.u. The root of T.u is process u itself, and T.u contains 
every process that is reachable from u via a directed path. The value of the 
constant x.u flows from u over T.u to every process in T.u. 
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6 Distance Vectors 

In this section, we modify the network in the previous section such that the 
value of each element X[v].u in each process u, when the network reaches a 
stable state, is the shortest distance (i.e., the smallest number of edges in a di- 
rected path) from process v to the network root. The modification is slight. The 
second and third actions in every process remain the same as before. Only the 
first action of each process u is modified to become as follows. 

X[u].u^ f{u,F.u,X.u) V d[u].u^0 

X[u].u := f{u,F.u,X.u)- 

:= 0 



Above, /(u, F.u, X.u) computes the shortest distance from u to the network root, 
and is defined as follows. 

/(m, F.u, X.u) = 0 if u = 0, 

1 ifuy^OAOG F.u, 

(min over v, v € F.u A d[v].u < n, of X[v\.u) 0 1 otherwise 
In the appendix, we present a proof of the stabilization of this network. 

7 Maintaining a Routing Tree 

To make the network in the previous section maintain a routing tree (as defined 
in Section OJ, each process u, u ^ 0, is modified as follows. First, the following 
two variables next.u and dist.u are added to process u. 

var next.u : F.u, 

dist.u : 0 . .n 

Second, the following action is added to process u. 

next.u ^ h{F.u, X.u) V dist.u ^ X[u].u — > 
next.u := h{F.u, X.u); 
dist.u := X[u].u 

where 

h{F.u, X.u) = the smallest identifier w in F.u such that 
X[u].u = X[w\.u (B ^ 

8 Message Passing Implementation 

In order to simplify our presentation, processes in our network communicate 
with their neighbors using shared memory. In this section, we discuss a message 
passing implementation of our network. 
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We first address the construction of the shortest spanning trees discussed in 
Section 0 In our shared memory model, each process reads array d from each of 
its backward neighbors. To implement this, each process places the contents of 
its d array into a message, and periodically sends this message to all its forward 
neighbors. We will refer to this message as the spanning-tree message. Since 
array d has n entries, the size of the spanning-tree message is 0(n). 

We next address the broadcast of the distance to the root (i.e. X[u].u) by 
every process u. To implement this, each process u places its distance to the 
root and its process identifier into a message, which we refer to as the broadcast 
message. This message is periodically sent to all forward neighbors of u. When a 
process v receives a broadcast message whose process identifier is u, this message 
is forwarded to all the forward neighbors of v, provided the message was received 
from neighbor b[u].v. In this way, the broadcast message is forwarded only along 
the spanning tree T.u, and the propagation of this message is cycle free. Note 
that the size of the broadcast message is 0(1). 

The above two messages, namely the spanning-tree and broadcast message, 
are all that is required to implement the process network in a message passing 
model. We next address the storage requirements of each process. 

Each process is required to store its d array, whose size is 0(n). With respect 
to the distances to the root (i.e., array X), note that when a process computes 
its distance to the root, it only requires the distance to the root of each of its 
forward neighbors. Therefore, in a message passing implementation, the distance 
to the root broadcasted by any other process is simply forwarded as soon as it 
is received. Thus, only the distances of the forward neighbors need to be stored. 

Furthermore, with a more careful implementation, only the distance to the 
root of the next hop neighbor (next.u) needs to be stored. If process u receives 
a broadcast from a forward neighbor indicating a smaller distance to the root 
than that of neighbor next.u, then next.u is updated to this neighbor, and the 
new distance is recorded. Thus, we require 0(1) storage for the distance to the 
root. 

In our network, we considered only a single process (the root process) as 
the destination. To allow any process to be a destination, each process needs to 
maintain the distance to each destination. Thus, instead of 0(1) storage for the 
distance to the root mentioned above, we require 0(n) storage, i.e., 0(1) stor- 
age for each of the n possible destinations. Note, however, that array d remains 
as before, since our original network allows every process to perform a broad- 
cast. Since array d requires 0(n) storage, the storage remains 0(n). In addition, 
the broadcast message would include the distance to each destination, and thus 
would be of size 0(n). The spanning-tree message remains 0(n). Therefore, since 
vectors of size n are sent to each neighbor, this network falls into the category 
of distance- vector routing networks. 

One final issue remains to be addressed, and that is the detection of channel 
failure. Thus far, we have assumed that if a process v is in the forward neighbor 
set F.u of a process u, then the channel from u to u is in working order. This is 
possible in networks where the channel is implemented by a lower layer, and the 



60 



Jorge A. Cobb and Mohamed G. Gouda 



status of this channel is monitored by the lower layer (e.g., the channel could be 
an ATM circuit, and the status of the channel is maintained by the ATM layer). 
If the lower layer at u detects that the channel from it to u has failed, then v is 
removed from F.u. However, if the lower layer does not provide the capability of 
monitoring the status of the channel, process u can monitor the status as follows. 
When process v sends a broadcast message, in addition to including its distance 
to each destination, it also includes a list of its backward neighbors from whom 
it has recently received messages. When process u receives a broadcast from v, 
it checks if u is in the list of backward neighbors of v. If so, then the channel 
from It to u is in working order. 

Note that if the broadcast message includes the list of backward neighbors, 
we may choose to only include this list in the message, and not include the dis- 
tance to each destination. In this case, each process would have to collect the 
list of neighbors from each process in the network, build in its memory a graph 
representing the network topology, and choose its next hop neighbor using Di- 
jkstra’s pg shortest path algorithm. In this case, the storage required would 
increase to O(n^), and the network would fall into the category of link-state 
routing networks. 

Finally, note that if a list of neighbors is required in the broadcast message, 
either because we have a link-state network or we require to detect the status of 
the channel in a distance-vector network, then the requirements for a process u 
to be in the routing tree are different than those presented in Section 01 In this 
case, a path must exist from u to the root and a path must also exist from the 
root to u. 



9 Concluding Remarks 

In this paper, we presented a network of processes that constructs a routing 
tree to a given destination, even though the network is directed, i.e., communi- 
cation between neighboring processes may be unidirectional. We presented our 
network in three steps. First, we presented a network that allows each process 
to broadcast a value to all other processes. Next, we presented a network where 
each process can compute its distance to the destination, and broadcast this 
distance to all processes. Finally, we presented a network where a routing tree 
is constructed by having each process choose its parent in the routing tree in 
accordance to its distance to the destination. 

Since an undirected network is a special case of a directed network, the net- 
work of processes we presented in Section 0 will correctly build a routing tree in 
an undirected network. However, it will not do so in the most efficient way, since 
processes are tailored towards a directed network. In future work, we will inves- 
tigate networks of processes whose behavior will vary depending on the number 
of processes that have unidirectional communication with their neighbors. That 
is, processes will adapt to the “level” of unidirectional communication in the 
network, and adapt their behavior accordingly to improve performance. 
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Routing in directed networks is a fertile area of research, and much is yet 
to be done. Existing approaches assume bidirectional communication between 
neighbors, and thus will fail or exhibit different behaviors in directed networks. 
In addition, other distributed algorithms, in addition to routing, may be affected 
by directed networks. Two reasons for this may be routing asymmetry, i.e., for 
two nodes u and v, the path from m to u is not necessarily the same as the path 
from V to u, and one-way reachability, that is, there is a path from m to u but 
there is no path from v to u. 
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Appendix: Proof of Stabilization 

Let Pmin{v,u) be a shortest path from v to u. Let G{v) be the graph obtained 
from all edges of the form (5[u].u, u) for every process u, v. Let Pg{v, u) be 
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the path from r; to u in C?(u]Q. If no such path exists, then Pg{v, u) is the empty 
path. Let btree denote the following predicate. 

(V u :: d[u].u = 0) A 

iy v,u V ^ u A Pmin {v,u) = 0 : d[v].u = n) A 
(y V,U : V u A Prain{v,u) ^ 0 : 

Pg{v,u) yf 0 A \Pg{v,u)\ = \Pmtn(v,u)\ =d[v].u) 



Lemma 1. 

1. btree is stable 

2. true converges to btree 

Proof. We focus only on arrays d.u and b.u of each process u, since only these 

variables are involved in btree. We first show the stability of btree. 

Consider the first action. This action affects only d[u\.u. From btree ^ d[u].u = 

0 before the action. Thus, executing the action does not change d[u].u. 

Consider the second action. This action affects only d\v].u. We have two 

cases. 

1. Consider first d[v].w < n. In this case, from btree, Pg{v,w) yf 0, and 
d[v].w = \Pg{v,w)\ = \Pmin{v,w)\. Since Pg{v,w) yf 0 and b[v].u = w, 
then Pg{v,u) = Pg{v,w); (w,u), and thus Pmin{v,u) yf 0. This, along with 
btree, implies that d[v].u = \Pmin{v,u)\ = |Pg(u,m)| = |Pg('J^)W)| + 1 = 
d[v].w + 1. Also, since \Pminiv,u)\ < n, then d[v].w + 1 < n, and hence, 

+ 1 = d[u].w 0 1. Thus, d[v].u is not changed when it is assigned 
d[v].w 0 1. 

2. Consider instead d[v].w = n. In this case, from btree, Pmin{v, w) = 0. Hence, 
Pg{v,w) = 0. From b[v].u = w, we have Pg{v,u) = 0. From btree, if 
Pmin{v,u) yf 0, then Pg{v,u) yf 0. Thus, we must have Pmin{v,u) = 0. 
Again, from btree, d[u].u = n, and thus d[v].u does not change when it is 
assigned d[u].w 0 1. 

Consider now the third action. Again, we have two cases. 

1. Consider first d[v].u < n. From btree, d[v].u = \Pmin{v,u)\ = |Pg(' 1 ')W)|. 
From the action’s guard, d[v].w 0 1 < d[v].u, which implies d[v].w < n, 
and from btree, d[v].w = \Pg(v,w)\ = \Pmin{v,w)\. Note, however, that 
since ic is a backward neighbor of u, d[v].w 0 1 < d[v].u implies that 
\Pg{v,w);{w,u)\ < \Pg{v,u)\, which is impossible, since Pg{v,u) is the 
shortest path from v to u. Thus, the guard must be false if btree holds. 

2. Consider now d[v].u = n. From btree, there is no path from v to u. However, 
if d[u].w0 1 < d[v].u, then d[v].w < n, and btree implies there is a path from 
V to w, and thus there is also a path from v to u. Thus, again, the guard 
must be false if btree holds. 

^ Note that at most only one such path may exist, since u has only one incoming edge 
in G(v), and v has no incoming edge. 
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Since no action falsifies btree, btree is stable. We now show that true con- 
verges to btree. First, note that the first action of u assigns zero to d[u].u. No 
other action modifies d[u].u, thus, for all m, true converges to d[u].u = 0, and 
d[u].u = 0 is stable. 

Next, define btree{v,i), where 1 < z < n, as follows: 

{yu:Uyt=v A {Pmm{v,u) = tb \J \Pmin{v ,u)\ > l) d[v].U > l) A 
U U ^ V A Pminiy^ Zz) ^ 0 A \Pmin tz) | ^ Z : 

d[v].U= \Prmn(v,u) \ = \Pg{v,u)\) 

We show that eventually, for all z, btree{v,i) holds and continues to hold. We 
show this by induction. 

As a base case, consider z = 1. Consider any process u, u ^ v. The second 
and third actions assign at least 1 to d[v].u. Thus, we can assume we reach a 
state where d[z;].zz > 1 holds and continues to hold for all processes u, u ^ v. 

Consider a process u which is not a forward neighbor of v. Thus, for any 
backward neighbor w of u, d[z;].z« > 1, and thus d[v].u > 2 after the second 
or third action. Thus, d[v].u > 2 will hold and continue to hold, as desired in 
btree{v, 1). 

Consider now a process u which is a forward neighbor of v. We have two 
cases. 

1. Assume u executes its second or third action with parameter w = v. 
Then, since b[v].v = 0, we obtain b[v].u = v A d[v].u = 1, thus, d[v].u = 
\Pmin{v,u)\ = |Pg(z;,zz)| as desired by btree{v,l). Note that this contin- 
ues to hold for the following reason. First, the second action of u does not 
change b[v].u, and it assigns d[v].v 0 1 = 1 to d[v].u. The third action of u 
is not enabled, since for any backward neighbor w of u (including w = v), 
d[v].w 0 1 > 1 = d[v].u. 

2. Assume u executes its second or third action with parameter w ^ v. Since 
d[v].w > 1, it results in b[v].u = w A d[v].u > 1. Since d[v].u > 1, then the 
third action of u is enabled with w = v, and must be eventually executed, 
after which b[v].u = v A d[v].u = 1, as shown above. Also, this continues to 
hold as shown above. 

For the induction hypothesis, we assume 1 < i < n, and btree(v, z — 1) holds. 
Consider first any process zz, where u ^ v A Pmin{v,u) = 0. For any back- 
ward neighbor w of zz, Pmin{v,w) = 0, and from btree{v,i — 1), d[v].w > i — 1. 
Hence, when process u executes its second or third action, d[v].u > i, as desired. 
Since d[v].w > z — 1 will continue to hold, then so will d[v].u > i. 

Consider next any process u, u ^ v, where Pmin yf 0 A \Pmin{v,u)\ > 
z > z — 1. In this case, any backward neighbor w of u must have Pmin{v,u) = 
0 V \Pmin{v,w)\ > z > z — 1, and from btree{v,i — 1), d[v].w > i, and this con- 
tinues to hold. When u executes its second or third action, d[v].u > i will result. 
Thus, d[v].u > i will hold and continue to hold. 

Consider now a process zz with Pmin yf 0 A \Pminiv,u)\ = i. This implies 
that for all backward neighbors w of zz, Pmin{v,w) = 0 V \Pmin{v,w)\ > z — 1. 
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From btree{v, i — 1), d[v].w > i — 1. In addition, u must have a backward neigh- 
bor y such that Pmin{v,y) ^ 0 A \Pmm{v,y)\ = i - 1. From btree{v,i - 1), 
i — 1 = d[v\.y = \Pa{v,y)\ = \Pmin{v,y)\ holds and will continue to hold. We 
have two cases. 

1. Assume the second or third action of u is executed where w = y. Then, 
after the action is executed, b[v].u = y /\ d[v].u = i, and hence, Pa{v,u) = 
PG{v,y); (y,u) A |Pg(^^)U)| = i = \Pmin{v,u)\ as desired for btree{v,i)- 
Furthermore, executing the second action does not change these values, 
and note that the third action cannot execute, since all backward neigh- 
bors w of M must have d[v].w > i — 1, and thus, d[v].w 0 1 > d[v].u. Hence 
i = d[v].u = \Pmin{v,u)\ = |Pg(u,m)| will continue to hold forever. 

2. Assume the second or third action of u is executed for some backward neigh- 
bor w, where w ^ y.ln this case, Pmin{v,w) = 0 V \Pmin{v,w)\ = i- From 
btree{v,i — 1), d[v].w > i. Then, after the action, d[v].u > iT 1. Note that in 
this case, the third action is enabled with w = y, and it will eventually be 
executed. As argued above, i = d[v].u = \Pmin{v,u)\ = |Pg('(^)U)| will then 
hold and continue to hold. 

This finishes the induction step, and thus the induction proof. 

Note that btree = (Vu,z:l<f<n: btree{v,i)) A (Vm :: d[u].u = 0). Thus, 
from the above, eventually we reach a state where btree holds and continues to 
hold. 



Corollary 1. For all u, v, and y, where u ^ v, 

btree A b[v].u = y is stable 



Proof. Only the third action affects b[v].u, so we focus on this action. 

Assume first that Pmin{v, u) = 0. Thus, Pmin{v, w) = 0. From btree, d[v].u = 
n A d[v].w = n. Thus, the third action is not enabled. 

Assume next that Pmin{v,u) yf 0. From btree, d[v].u = \Pmin{v,u)\ < n. For 
any backward neighbor w of u, if the guard d[v].w 0 1 < d[v].u is true, then this 
implies d[u].w < n — 1, and from btree, d[v].w = \Pmin{v,w)\. Combining the 
above, \Pminiv,w)\ 0 1 < \Pmin{v,u)\. This is not possible since Pmin{v,u) is a 
minimum path. Thus, the third action is not enabled. 



Lemma 2. 

1. Consider a computation in which btree /\ X[v].v < k holds for some k and 
continues to hold. Let u ^ v and Pg{v, m) yf 0. Then, for any y, where y ^ v 
and y is a process in Pg{v,u), X[v].y < k will hold and continue to hold. 

2. Consider a computation in which btree A X[v].v > k holds for some k and 
continues to hold. Let u ^ v and Pg{v, m) yf 0. Then, for any y, where y ^ v 
and y is a process in Pg{v,u), X[v].y > k will hold and continue to hold. 
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Proof. We consider only part 2 above. The proof for part 1 is similar. 

From Lemma n and Corollary ^ btree continues to hold, and Pg{v,u) does 
not change. Also, only the second action modifies A[r;].it, so we focus on this 
action. 

The proof is by induction on the length of Pg{v, u). Assume that Pg{v, u) = 
(v,u), i.e., h[v].u = v. If X[v].u X[v].v, then the second action of u is enabled 
with V = w. When this action executes, X[v].u = X[v].v > k. If this action be- 
comes disabled before being executed, then from its guard it must also be that 
X\v].u = X[v].v > k. Thus, X[v].u > k will hold and continue to hold. 

For the induction step, let w = b[v\.u, and let \Pg{v,u) \ = i,i > 1. We assume 
the lemma holds for all paths in G{v) of length i — 1, in particular, Pg{v,w). 
Thus, eventually, X[v\.w > k holds, and it continues to hold. A similar argument 
as the one above, except that process u assigns A[r>].z(; to X[v].u, shows that 
X\v\.u > k will hold and continue to hold. This concludes the proof. 

Let Si, where 0 < i < n, be a set of processes defined as follows. 

50 = { 0 } 

51 = { u I u is a backwards neighbor of process 0} U So 
5^ = { u I there is a path from a backwards neighbor v 

of the root to u, and a path of length at most i 
from u to the root via r;}US'i,l<*<n 

Note that a process u is in the routing tree iff m S Sn-i- 

Lemma 3. For every i, 1 < i < n, 

u G Si {u G Si V {3v :v G F.u : v G Si-i A Pmin{v,u) ^ 0)) 

Proof. Consider first the following implication. 

u G Si ^ {u G Si V {3v : V G F.u : v € Si-i A Pmin{v, u) ^ 0)) 

If u is in Si then, from the definition of Si, u is in Si for any i, i > 1. Instead, 
assume u is not in Si. Thus, assume there exists a v satisfying the quantification. 
We are given that v G S'i-i. Note that i — I > 0, since otherwise, v would be the 
root, and u would be in Si. 

Assume first that i — 1 = 1. In this case, ■(; is a backward neighbor of the root, 
and from Pmin{v,u) 0, there is a path from v to u. Thus, u is in Si. Assume 
instead that i — 1 > I. Then there is a path of length i — 1 from v to the root 
via a backward neighbor w of the root, and a path from w to v. Since v G F.u 
and Pmin{v, u) ^ 0, there is a path of length i from u to the root via w (and via 
v), and also a path from w to u (via v). Thus, u is in Si. 

Consider now the other implication. 

u G Si ^ {u G Si V (3 G F.u : v G Si-i A Pmin{v, u) ^ 0)) 

If u is in Si, then we are done. Assume instead that u is in Si, but u is not in ^i. 
Then, since u belongs to Si, there is a path from u to the root via a backward 
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neighbor w of the root, and the length of this path is i. Let v be the next hop 
from u to w along this path. Thus, there is a path of length i — 1 from v to the 
root via w. Also, since there is a path from w to u, then there is a path from v 
to u. Hence, Pmin{v,u) ^ 0 and v is in Si-i. 

Combining both implications we obtain the desired result. 

Theorem 1. For every i, 0 < i < n, the distance vector network stabilizes to 
the following predicate. 

(yu::uG 5^ < i) 

Proof. X[u].u is changed only in the first action, so we focus only on this action. 
Also, from Lemmas the network stabilizes to btree. Thus, consider a computa- 
tion starting from a state in which btree holds. 

Consider first i = 0. The only process in Sq is the root process 0. The defi- 
nition of function / ensures that X[0].0 is assigned 0 regardless of the network 
state. Thus, AT[0].0 = 0 is stable. For any process u, u ^ 0, f assigns at least 1 
to X[u].u. Thus, u G 5'o ^ < 0 will hold and continue to hold. 

Consider next i = 1. Let u G S\. By the definition of Si, the root is a forward 
neighbor of u. From the definition /, X[u].u will be assigned 1 regardless of the 
network state. Thus, X[u].u = 1 will hold and continue to hold. Consider now 
any process u, where u ^ Si. Thus, for every forward neighbor v ot u, v ^ 0, i.e., 
V ^ So, and from above, X[v].v > 1 will hold and continue to hold. If there is a 
path from v to u then, from Lemma Impart 2, eventually X[v].u > 1 holds and 
continues to hold. If there is no path from v to u, then from btree, d[v].u = n 
holds and continues to hold. Thus, from the definition of /, X[u].u will always 
be assigned at least 2, i.e., X[u].u > 1 will hold and continue to hold. 

The remainder of the proof is by induction over i, where 1 < i < n. We show 
that for each i, the network stabilizes to {V u u G Si X[u].u < i). The base 
case, i = 1, was shown above. Thus, consider 1 < i < n, and assume we have a 
computation where we have reached a state where (Vit :: u G 5^-1 X[u].u < 
i — 1) holds and continues to hold. 

Consider a process u, u G Si, and u ^ S'i-i. From the definition of Si, and 
from Lemma 0 a forward neighbor v belongs to Si-i and there is a path from v 
to u. From the induction hypothesis, X[v\.v <i — t, and from btree, d[v].u < n. 
Furthermore, from Lemma 0 part 1, eventually X[v].u < i — ^ holds and contin- 
ues to hold. Thus, from the definition of /, X\u].u is assigned a value at most i, 
and this continues to hold. 

Consider now a process u, u ^ Si. From Lemma 0 u ^ Si, and for all for- 
ward neighbors v of u, v ^ Si-i V Pmin{v,u) = 0. If v ^ Si-i, then from the 
induction hypothesis, X[ri].z) > i — 1, i.e., > i, holds and continues to 

hold, and from Lemma 0 part 2, eventually X[v].u > i holds and continues to 
hold. If Pmin{v,u) = 0, then from btree, d[v].u = n. Hence, from /, eventually 
X[u].u > i holds and continues to hold. 

Thus, by induction, for all i,0 < i < n, the network stabilizes to (Vu :: u G 
Si X[u\.u < i). 
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Abstract. The first self-stabilizing algorithm published by Dijkstra in 
1973 assumed the existence of a central daemon, that activates one pro- 
cessor at time to change state as a function of its own state and the 
state of a neighbor. Subsequent research has reconsidered this algorithm 
without the assumption of a central daemon, and under different forms 
of communication, such as the model of link registers. In all of these 
investigations, one common feature is the atomicity of communication, 
whether by shared variables or read/write registers. This paper weak- 
ens the atomicity assumptions for the communication model, proposing 
versions of Dijkstra’s algorithm that tolerate various weaker forms of 
atomicity, including cases of regular and safe registers. The paper also 
presents an implementation of Dijkstra’s algorithm based on registers 
that have probabilistically correct behavior, which requires a notion of 
weak stabilization, where Markov chains are used to evaluate the prob- 
ability to be in a safe configuration. 



1 Introduction 

The self-stabilization concept is not tied to particular system settings. Our work 
considers several new system settings and demonstrates the applicability of the 
self-stabilization paradigm to these systems. In particular, we investigate sys- 
tems with regular and safe registers and present modifications of Dijkstra’s first 
self-stabilizing algorithm |S| that stabilizes in these systems. 

Dijkstra’s presentation of self-stabilization ISIIEIID relies on communication 
by reading neighbor states and updating one machine’s state in one atomic op- 
eration. The well-known first algorithm of represents a machine state by 
a counter value. Subsequently, this fundamental algorithm has been adapted to 
link register |S| and message-passing models. These adaptations are straight- 
forward, essentially changing only the domain of the counter values and taking 
care to compare communication variables to the local state variables in the cor- 
rect manner. The processing of communication variables, be they message buffers 
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or communication registers, is atomic in these adaptations of On the other 
hand, only few papers [TTirTTiUnj R] address non-atomic communication oper- 
ations in the context of self-stabilization. Lamport initially demonstrated that 
interprocess communication without explicit synchronization is possible and 
formalizations of less-than-atomic communication were subsequently developed 
in pncnj. The register hierarchy and register constructions of m inspired an 
active research area. The register hierarchy (safe, regular, and atomic registers) 
has many motivations, including implementation cost for shared register opera- 
tions. Another view of weaker forms of registers (safe or regular, when compared 
to atomic) is that they are possible “failure modes” atomic registers. Thus, if 
we can adapt algorithms such as [S| to accommodate weaker communication 
assumptions, the result will be an algorithm that not only recovers from tran- 
sient faults but also deals with certain types of functional errors — hence the 
title of our paper, the so-called “unsupportive environments.” The idea of com- 
bining self-stabilization with other forms of fault-tolerance has previously been 

studied diniEiiiniE] 

We summarize our modifications to as follows. The solution for the regu- 
lar registers case uses a special label in between writes of counter values. In the 
case of safe registers we prove impossibility results, for the cases in which neigh- 
boring processors use a single safe register to communicate between themselves 
— where the register is/isn’t divided to multiple fields. In the positive side, we 
define a composite safe register that, roughly speaking, ensures reads return at 
most one corrupted field and design an algorithm for that case. Subsequently, we 
consider a stronger model where processors can read the value written in their 
output registers (therefore avoiding extra writes for refreshes). We present two 
algorithms for the above case, one that uses unary encoding and another that is 
based on Gray code. 

Then we introduce randomized registers that, roughly speaking, return the 
“correct value” with probability p. It is impossible to ensure closure in such a 
system, since all reads may return incorrect values. We introduce the notion of 
weak self-stabilization for such systems. We use Markov chains to compute the 
ratio between the number of safe configurations and unsafe configurations in an 
infinite execution. 

Markov chains associate each state (system configuration) with a probability 
to be in that state during an infinite execution. The fixed probability of the state 
is a “stabilizing” value. It is clear that the probability is either zero or one in 
the first configuration. Given the probability of transitions between configura- 
tions, one can compute the stable probability in an infinite execution, which is 
typically greater than zero and less than one. We found the definition of weak 
stabilization and the use of Markov chains to be an interesting and promising 
way for extending the applicability of the self-stabilizing concept. 

The remainder of the paper is organized as follows. The next section reviews 
Dijkstra’s algorithm and discusses the problem of adapting it to different register 
models. Section 0 describes a solution for regular registers. Then in Section 0 
we present impossibility results and algorithms for different settings of systems 
that use safe registers. Randomized registers and the use of Markov chains are 
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presented in Section 0 Some concluding remarks are given in Section]^ Detailed 
proofs are omitted from this extended abstract. 

2 Adapting Dijkstra’s Algorithm 

For the remainder of the paper, we let Dij Alg refer to the first self-stabilizing algo- 
rithm of |S| • This algorithm is expressed by guarded assignment statements and 
based on a ring of n machines that communicate unidirectionally with atomic 
(central demon) execution semantics. Generally, conversion of self-stabilizing al- 
gorithms from one model to another can be difficult uni. However the specific 
case of DijAIg is not difficult to adapt to register models. 

For the register models, each “machine” of DijAIg is replaced by a processor, 
which has an internal state represented by variables and a program counter. 
Each processor pi, for 0 < i < n, can read from one or more input registers and 
write to one or more output registers (these communication registers are often 
called link registers). The unidirectional nature of the ring implies that the set 
of output registers for pi can only be written by pi and that the input registers 
of Pi+i (processor subscripting implicitly uses mod n arithmetic) are precisely 
the output registers of pi . One variant of the register model allows pi to write 
its output registers, but prohibits pi from reading its output registers. In the 
literature on registers, this type of link register is called IWIR, because the reg- 
ister has exactly one writer pi and exactly one reader Pi+i. We also consider the 
case of 1W2R link registers, which allows both pi and pi+\ to read the output 
register of pi . 

The control structure of each processor consists of an infinite iteration of a 
fixed list of steps that read input registers, perform local calculations, update 
processor variables, and write to output registers. We call each such iteration of 
a processor a cycle. 

The processor-register model of a system does not specify algorithms using 
the guarded assignment notation of jS]. Instead, processors are non-terminating 
automata that continuously take steps in any execution, where an execution is a 
fair interleaving of processor steps ^ . This means that the direct association of 
“privilege” with “enabled guard” in jS] does not apply to the processor-register 
model. Instead, we use the idea of a token, introduced by 0, since each processor 
is always enabled to take steps. 

We do not present a formal statement of what it means to adapt DijAIg to 
other models in this extended abstract. Informally, an adaptation of DijAIg satis- 
fies the following constraints for each processor: except for the program counter 
and temporary variables used to read communication registers or used for local 
calculations, there is one counter variable that represents the state of a machine 
of DijAIg; processor po plays the role of the exceptional machine of DijAIg by 
incrementing its counter (modulo some value K) when the values it reads from 
its input registers represent its current counter value, and otherwise its counter 
value does not change; and processor pi, i ^ 0, plays the role of an unexcep- 
tional machine of DijAIg, changing its counter only if it reads input registers 
that represent a value different from its current counter, and then pi assigns its 
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counter to be equal to the representation from the input registers. A “token” is 
thus equivalent to the conditions that enable a processor to change its counter. 

In subsequent sections we investigate adaptations of DijAIg using non-atomic 
registers. Observe that, since the nature of DijAlg’s legitimate behavior is single- 
token circulation in the ring (mutual exclusion), it follows that transferring a 
token from one processor to the next is essentially atomic — the token only 
moves in one direction and cannot reverse course. Thus the challenge of using 
non-atomic registers is to simulate this atomic behavior. Section 0 weakens this 
atomicity for a model of registers that are only probabilistically correct. 

3 Regular Registers 

Before we introduce our results for the case of regular registers let us present 
“folklore” results concerning read/ write registers. Here and for the remainder of 
the paper, we refer to processors by pi • • • rather than po ■ ■ • Pn-i- 

Read/Write Atomicity: It is known that n — I labels are sufficient for the 
convergence of DijAIg assuming a central daemon, where n is the number of 
processors in the ring. We next prove that n — 2 labels are not sufficient. 



Lower Bound: Consider the case of n — 2 states in a system of n = 5 proces- 
sors. Thus there are three possible processor states, which we label {0,1,2}. To 
prove impossibility we demonstrate a non-converging sequence of transitions (the 
key to constructing the sequence is to maintain all three types of labels in each 
system state, which violates the key assumption for the proof of convergence). 

{ 0 , 0 , 2 , 1 , 01 ^ 11 , 0 , 2 , 1 , 01 ^ 11 , 0 , 2 , 1 , 1 } 

^ {l,0,2,2,l}a {1,0,0,2,1} ^ {1,1, 0,2,1} 

We now present a reduction (see 0) of a ring with 2n processors that is acti- 
vated by a central daemon to a ring with n processors that assumes read write 
atomicity. We conclude that at least 2n — 1 states are required. 

Each processor pj has an internal variable in which pj stores the value pj 
reads from Pj-i- Each read is a copy to an internal variable and each write is a 
copy of internal variable to a register. Thus, we have in fact a ring of 2n proces- 
sors in a system with a central daemon. Hence, 2n — 1 states are required and 
are sufficient. 



Regular Registers: We now turn to the design of an algorithm for the case 
of regular registers. Informally, a regular register has the property that a read 
operation concurrent with a write operation can return either the “old” or “new” 
value. More formally, to define a regular register r we need to define the possible 
values that a read operation from r returns. Let be the value of the last write 
operation to r that ends prior to the beginning of the read operation (let be 
the initial value of r if no such write exists). 
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A read operation from a regular register r that is not executed concurrently 
with a write operation to r returns . A read operation from a regular register 
r that is executed concurrently with a write of a value returns either or x^ . 
Note that more generally, a read concurrent with a sequence of write operations 
of the values a;^, • • • , a;"* to r could return any x^ , 0 < k < m, however once 
a read returns x^ for /c > 0, no subsequent read by the same reader will return 
x^ for j < k — 1 (for two successive read operations, there is at most one write 
operation concurrent with both). 



Si 


S2 


S3 


0 


0 


0 Pi starts to write 1 


1 


0 


0 pi still writing 1, and P 2 reads 1 


1 


1 


0 Pi still writing 1, and P 2 writes 1 


1 


1 


0 Pi still writing 1, p 2 writes 1, and pa reads 1 


1 


1 


0 pi still writing 1, p 2 reads 0 


1 


0 


1 p 2 writes 0, and p 3 writes 1 


1 


0 


1 Pi reads 1, p 2 reads 1, and ps reads 0 



Fig. 1. Straightforward regular register implementation fails. 



A naive implementation of DijAIg using regular registers may result in the 
execution presented in Figure ^ The figure shows states of a three processor 
system. The system starts in a safe configuration in which all the values (in the 
registers and the internal variables) are 0 and we have reached a configuration 
in which all the processors may simultaneously change a state. 

To overcome the above difficulty we introduce a new value T that is written 
before any change of a value of a register (the domain of register values thus 
includes all counter values and the new value T). The algorithm for the case of 
regular registers appears in Figure 0 In the figure, IR^ is the input register for 
Pi (thus IRi is the output register ofpi-i). Variable Xi contains the counter from 
DijAIg, and variable ti is introduced to emphasize the fine-grained atomicity of 
the model (one step reads a register, and the value it returns is tested in another 
step) . If processor pi reads a value T from an input register, it ignores the value 
(see line 10 of the protocol). 

A safe configuration is a configuration in which all the registers have the 
same value, say x, and every read operation that has already started will return 
X. For simplicity we assume there are 2n -I- 1 states. Therefore, it is clear that 
a state is missing in the initial configuration, say the state y. Hence, when pi 
writes y, p\ does not change its state before reading it from p„. can read y 
only when Pn-i has the state y. Any read operation of Pn-i that starts following 
the write operation that assigns y to Pn-i may return either T or y, which is 
effectively y (see lines 3 to 6 and 10 to 13 of the code). 
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1 Pi- 


do forever 


2 


read := IRi 


3 


if ti = xi then 


4 


x\ (xi + 1) mod (2n + 1) 


5 


write IR2 := T 


6 


write IR2 := xi 


7 


else write IR2 := xi 


00 


do forever 


9 


read ti := IRi 


10 


if Xi ti A ti t^T then 


11 


Xi := ti 


12 


write IRi+i := T 


13 


write IRi+i := Xi 


14 


else write IRi+i := Xi 



Fig. 2. DijAIg for Regular Registers. 



4 Safe Registers 

Safe registers have the weakest properties of any in Lamport’s hierarchy. A read 
concurrent with a write to a safe register can return any value in the register’s 
domain, even if the value being written is already equal to what the register 
contains. There are two cases to consider for the model of safe registers. If a pro- 
cessor is unable to read the register(s) that it writes, we can show that DijAIg 
cannot be implemented. We initially consider the model of a single link register 
for each processor under the restriction that a writer is unable to read its output 
registers, that is, the model of IWIR registers. 

The construction of a regular IWIR boolean register from a safe boolean 
IWIR register given in m is simple: the writer skips actually writing to the 
register if the new value is the same as the value already in the register. This is 
possible because the writer keeps an internal variable equal to the current value 
of the register. However, this technique is not self-stabilizing because the initial 
value of the writer’s internal variable may not be correct. 

Lemma 1. DijAIg cannot be adapted by using only a single IWIR safe register 
between Pi and pi+\. 

Proof. Processor pi (i ^ 1) that copies from the output register of pi-i must 
continually rewrite its output register for pi+i — otherwise there can be a dead- 
lock where the value written by pi is different from the value pi reads from pi-i. 
Similarly, pi must repeatedly write, otherwise there can be a deadlock where 
the value written by pi is the same as the value pi reads from Pn-i- Therefore, 
processors continually write into their output registers. Since all processors re- 
peatedly write their output registers, we can construct an execution where reads 
are concurrent with writes and obtain arbitrary values. This construction can be 
used to show that the protocol does not converge (and also that it is not stable). 
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Multiple Fields Safe Register: Lemma Q] can be generalized, and we sketch 
the argument here. We consider the case of multiple safe registers per processor, 
but where processors cannot read the registers they write. Suppose each pro- 
cessor Pi has m safe registers to write, which reads, and also pi reads m 
safe registers written by Pi-\. If a protocol allows a state in which a processor 
does not write any of its registers so long as its state does not change, then we 
may construct a deadlock because the local state of the processor differs from 
the encoding of values contained in its output registers. Therefore, in any im- 
plementation of the protocol, we can construct an execution fragment so that 
any chosen processor pi writes at least some of its registers t times, for arbitrary 
t > 0, and during the same execution fragment, pi-i takes no steps. Moreover, if 
Pi does not write to all m registers, then the registers it does not write can have 
arbitrary values inherited from the initial state. Therefore, Pi+i can read any 
value from pi, since at each step of Pi+i reading one of the m registers written 
by Pi, we can construct an execution in which pi is concurrently writing to the 
same safe register. Because pi+i can read any value, it is possible that for i ^ n 
that Pi+i reads a value equal to its own current value, which for DijAIg, means 
that Pi+i will maintain its current value rather than changing it; for the case 
i = n, there is an execution where each time p\ reads its input registers, the 
value read differs from its own value, and again p\ makes no change to its cur- 
rent value. These situations can repeat indefinitely with no processor entering 
the critical section. 



Composite Safe Register: Next we sketch a solution in which fields of the 
registers can be written and the entire register is read at once. We call such a 
register composite safe register. A read from a composite safe register may return 
an arbitrary value for at most one of the register fields, a field in which a write 
is executed concurrently to the reacO. We note that there is a natural extension 
of our algorithm in which at most k fields of a register may return an arbitrary 
value. 

Each bit of the label value is stored in three 1-bit safe registers (three fields) . 
This will ensure that a read during a refresh operation will return the value of 
the register. Assume that the value 101 is stored in nine 1-bit safe registers as 
111000111. Assume further that a processor refreshes the value written in these 
registers each time writing in one of the 1-bit safe registers. A read operation 
returns the value of the entire composite safe register in which at most one bit 
is wrong. The Hamming distance ensures that the original value of the label bit 
can be determined. 

To allow a value change we add a three bits guard value. Hence, the compos- 
ite safe register has three bits that function as a guard value and 3 x 2(n -|- 1) 
bits for the label. 

A processor pi, i ^ 1, that reads a new value from first sets the guard 
value to 0 (writing 000 in the guard bits), and then changes the value of the 
label. Pi writes 111 to its guard bits once pi finishes updating the label. 

^ This assumption reflects reality in system in which a read operation is much faster 
than a write operation. 
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A processor pi that reads a guard value 0 does not use the value read. When 
Pi reads a guard value 1 it examines the value it read. 

The correctness proof starts in convincing ourselves that after the first time 
a processor pi refreshes (or writes a new value in) its register any read operation 
from its register (that returns a value) results in the last value written to this 
register. pi eventually writes a non existing label, this label cleans the system. 
More details are omitted from this extended abstract. 



Safe Registers with Reads instead of Refreshes: Given the above impos- 
sibility results, we examine settings where a processor can read the contents of 
the registers in which it writes. Consider 2n -|- 1 single bit, safe, 1W2R registers 
rather than a single register per processor. Each processor maintains a counter 
with domain [l,2n-|- 1] for DijAlg. Unary encoding represents this counter: for 
a counter value k, the proper encoding is to write all registers 0 except for the 
register with index k, which has value 1. Figure Elgives an adaptation of DijAlg 
for this 1W2R model. Lines 2-7 are concerned with correcting the values of reg- 
isters to agree with internal counter values, but such writing is only done where 
needed. This correction is for the case of incorrect values in illegitimate initial 
configurations. However, for convenience, we let lines 2-7 also do the work of 
transmitting a new counter value (since lines 2-7 are repeated after lines 8-13 
in the execution of each processor) . 



1 Pi: 


do forever 


2 


do A: := 1 to 2n -I- 1 


3 


read ti := IR 2 [fc] 


4 


if k ^ xi A ti — 1 


5 


write IR 2 [fc] := 0 


6 


if k = xi A ti — 0 


7 


write IR 2 [fc] := 1 


8 


do s := 0, j := 0, fc := 1 to 2n + 1 


9 


read ti := IRi[fc] 


10 


if ti = 1 


11 


s := s -1- 1; j ■= k 


12 


if s = 1 A j = xi 


13 


xi ~ 1 + xi mod (2n -|- 1) 


1 Pi (i 7^ 1): 


do forever 


2-11 


[similar to code forpi) 


12 


if s = 1 A j ^ Xi 


13 


Xi := j 



Fig. 3. DijAlg adaptation for Safe Registers. 
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A legitimate configuration for this protocol is that each register vector repre- 
sents the processor’s last counter value (it differs only when a processor updates 
its counter) and counters correspond to DijAlg. 

Lemma 2. Figure^is a self-stabilizing adaptation of DijAlg. 

Proof. There are two proof obligations, stability (closure) from legitimate con- 
figurations and convergence from arbitrary configurations to legitimate ones. 



Closure. It is straightforward to verify that in any processor cycle from a legit- 
imate configuration, a processor writes to at most two registers as it changes the 
counter value. Thus when the neighbor reads these registers, at most two reads 
can have incorrect values due to concurrent writing. If both have correct values, 
the token passes correctly (a subsequent read by the process can still obtain an 
incorrect value, but only by getting 0 for all reads, which causes no harm). If 
both have incorrect values, then the reader observes no change in counter values. 
If just one returns an incorrect value, then the reader observes parity of zero, 
which is harmless. This reasoning shows that the protocol is stable. 



Convergence. The remaining task is to verify that the protocol guarantees to 
reach a legitimate configuration in any execution. Suppose all processors have 
completed at least one cycle of statements 1-13. In the subsequent execution, 
a processor only writes a register if that register requires change to agree with 
the processor’s counter. Note that by standard arguments, no deadlock is pos- 
sible in this system and that pi increments its counter infinitely many times 
in an execution. It is still possible that one processor can read more than two 
incorrect values due to concurrent writes (consider an initial state with many 
counter values; as these values are propagated to some pi, it could be that 
happens to read many registers concurrent with pi writing to them). Since the 
counter range is [l,2n -|- 1] and there are n processors, it follows that at least 
one counter value t is not present in the system. By the arguments given for the 
proof of closure, no processor incorrectly reads input registers to get the value 
t in such a configuration. Because pi increments xi infinitely, we can suppose 
xi = t but no other processor or register encoding equals t, and by standard 
arguments (and the propagation of values observed in the proof of closure), a 
legitimate configuration eventually is reached. 

A standard construction of 1W2R registers from IWIR registers is just to 
allocate an added new output register and arrange for the writer to ensure the 
outputs are duplicates. The reader may wonder if such a construction contra- 
dicts Lemmanand its generalization. There are two interesting cases to examine. 
First, we reject any protocol where pi+i has a new output register that pi reads, 
since such an adaptation of DijAlg would violate the unidirectional model con- 
straint. Second, adding a new IWIR register written by pi and read only by pi 
would be equivalent to an internal variable, which we have already considered 
in the proof of Lemma E Thus, under our constraints on adaptation of DijAlg, 
the models of IWIR and 1W2R registers differ in capability. 
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The protocol of Figure Q uses an expensive encoding of counter values, re- 
quiring 2n+l separate registers. The argument for closure shows that changing 
a counter and transmitting it is effectively an atomic transfer of the value — 
once the new value is observed, then any subsequent read of the registers either 
returns the new value or some invalid value (where the sum of bits does not 
equal 1), which is ignored. Note that this technique is not an implementation 
of an atomic register from safe registers; it is specific to the implementation of 
DijAlg. 

Can we do better than using 2n -|- 1 registers? The following protocol uses 
the Gray code representation P of the counter, plus a extra bit for parity. The 
number of registers per processor is m -I- 1 where m = [lg(2n -|- 1)] . 



1 Pi: 


do forever 


2 


do A: 1 to m 


3 


read := IR 2 


4 


if ti / Graycode{xi)[k] 


5 


write IR 2 [fc] := Graycode(xi)[k] 


6 


read ti IR 2 [m -|- 1] 


7 


if ti 7 ^ parity {Gray code [xi)) 


8 


write IR 2 [m -I- 1] parity {Gray code {xi)) 


9 


do A: 1 to m 


10 


read gi[k] := IRi[A:] 


11 


read ti IRi [m -|- 1] 


12 


if ti = parity{gi) A Graycode{xi) = gi 


13 


xi := {xi -|- 1) mod (2n -|- 1) 


1 Pi {ij^ 1): 


do forever 


2-11 


{similar to code for pi) 


12 


if ti = parity{gi) A Graycode{xi) 7 ^ gi 


13 


Xi := gi 



Fig. 4. DijAlg using Gray Code. 



Lemma 3. Figure ^ is a self-stabilizing adaptation of DijAlg. 

Proof. The closure argument is the same as given in the proof of Lemma |3 in- 
specting each of the four cases of reading overlapping with writing of the two 
bits that change when a processor changes its counter and writes the one new 
Gray code bit and the parity bit. In each case, the neighbor processor either 
reads the old value, or ignores the values it reads (because parity is incorrect), 
or obtains the new counter value. The change from old to new counter value is 
essentially atomic. 

Proof of convergence requires new arguments. Consider some configuration 
of an execution prior to which each processor has completed at least two cycles 
of statements 1-13 in Figure^] so that output registers agree with counter values 
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(unless the processor has read a new value and updated its counter). Observe 
that thereafter, if processor pi successively reads two different Gray code values 
from its input registers, each with correct parity, then pi-i concurrently wrote 
at least once to its output registers. Moreover, if pi successively reads k different 
Gray code values with correct parity, then pi-i wrote at least k — 1 times a 
new counter value and read at least k — 1 times from its own input registers 
(in turn, written by Pi- 2 )- A consequence of these observations is that if p\ suc- 
cessively reads k different counter values with correct parity, then Pn-k wrote 
at least one new counter value in the same period. In particular, if p\ succes- 
sively reads n -I- 2 different counter values, then we may assert that p 2 read pi ’s 
output registers and wrote a new counter in the same period. By the standard 
argument refuting deadlock, processor pi increments its counter infinitely often 
in any execution. Therefore we can consider an execution suffix starting with 
xi = 0. In the reflected Gray code the high-order bit starting from xi = 0 
does not change until the counter has incremented 2"* times. Therefore, until pi 
has incremented X\ at least 2™ times, any read by p 2 obtains a value with zero 
in the high-order bit. The observations above imply that, before xi changes at 
the high-order bit, each processor has copied some counter value that originated 
via Pi — such counter values may be inaccurate due to reads overlapping writes 
or more than one write (bit change) for one scan of a set of registers, however 
the value for the high-order bit stabilizes to zero in this execution fragment. In 
a configuration where no counter or register set has I in the high-order bit, the 
event of pi changing the high-order bit creates a unique occurrence of 1 in that 
position. Since pi does not again change its counter until observing the same 
value from p„, convergence is guaranteed. 



5 Randomized State Reads and Weak Stabilization 

Gonsider a system with a probabilistic central daemon, in any given configuration 
the daemon activates each of the processors with equal probability. A system is 
weakly stabilizing if, in any execution, the probability that the system remains 
in any set of illegitimate configurations is zero. This definition implies that a 
weakly stabilizing system has the property that its state is infinitely often legit- 
imate. In addition, one can sum up the probabilities for being in a legitimate 
state and use this value to compare algorithms. 

To apply the definition of weak stabilization, we model register behavior 
probabilistically: a processor that makes a transition may “read” an incorrect 
value and therefore make an errant transition. When the daemon selects a pro- 
cessor, we consider the transition by that processor to be one full cycle of reading 
its input and writing its output register atomically. We use Markov chains to 
analyze the stationary probabilities, i.e., to determine the probability that the 
system will be in a legitimate configuration. See m for a description of Markov 
chains. 

We continue describing our approach using a system of three processors and 
two states. A read of a neighboring state returns with probability p the correct 
value, and with probability 1 — p the complement of the correct value. There 
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1+P 



1+P 



Fig. 5. Transition Probabilities (factorized by 3). 



are eight states for this system; Fig. 0 shows the transition diagram for this 
system. We explain the figure by considering the configuration 111. With prob- 
ability 1/3, the daemon activates pi, and pi reads with probability 1 — p the 
incorrect value 0, which is ignored by DijAIg; thus the joint probability for this 
transition from 111 to 111 is (1 — p)/3. Two other transitions from 111 to 111 
are also possible: either pi or p 2 could read correctly with probability p, thus 
the probability for each of these transitions is p/3. The sum of all cases for this 
transition is therefore (1 — p)/3 -I- 2p/3 = (l-|-p)/3. To simplify the presentation, 
all transition arcs have an implicit factor of 3, so the 111 to 111 arc in Fig. 0 
is labeled (1 +p). Similar case analysis derives probabilities for the other arcs 
shown in the figure. Each configuration has four outgoing arrows, one arrow for 
each state change of a processor, and one for staying in the same state. 

We now choose specific values for p and compute powers of the transition 
probability matrix V, such that the matrix in power i and i -I- 1 are equal 
(7?® = 7^®+i). Each entry Vjk of the 8x8 transition probability matrix P con- 
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tains the probability for a transition from configuration j to configuration k. The 
equilibrium probability of being in a legal configuration (i.e., not in the configu- 
rations 010 or 101) is then derived from V’’. The following table shows different 
values for p (1, 3/4, 1/2, 1/4) and the corresponding equilibrium vector two 
figures display the transition matrix V for the cases of p = 1 and p = 3/4. 



Matrix 


P 


Equilibrium Vector 


Fig El 


1 


[l/6,l/6,0,l/6,l/6,0,l/6,l/6] 


Fig 13 


3/4 


[7/48,7/48,1/16,7/48,7/48,1/16,7/48,7/48] 




1/2 


[l/8,l/8,l/8,l/8,l/8,l/8,l/8,l/8] 




1/4 


[1/12,1/12,1/4,1/12,1/12,1/4,1/12,1/12] 



The vectors show that the equilibrium probability for illegitimate configu- 
rations is zero for the deterministic case, then increasing as p reduces. Clearly, 
we can investigate the behavior of other systems with a range of probabilities, 
using the same approach. The results can assist us in comparing different system 
designs. 

Lemma 4. The adaptation of DiJAIg is weakly stabilizing when register reads 
are correct with probability p > 0. 
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Fig. 6. Transition Matrix for p = 1. 



6 Conclusion 

This paper presents a number of results related to weakening the model of atomic 
link registers for the unidirectional ring model associated with DijAlg. Our results 
are both positive (the constructions for regular and safe registers) and negative 
(the impossibility for IWIR safe registers). Some similar experiences with the 
difficulties of self-stabilizing register constructions are reported in m, however 
the problem adapting DijAlg has additional constraints and also advantages: the 
constraint of unidirectional communication rules out certain techniques, but the 
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Fig. 7. Transition Matrix for p = 3/4 factorized by 3. 



ring topology does provide sufficient feedback (eventually information flows from 
Pi back to Pi-i) to make constructions possible. 

The introduction of probabilistic registers motivates a weakened form of 
self-stabilization. (Another model of probabilistically correct registers appears 
in m-) Our model of probabilistically correct registers could be overly pes- 
simistic: it could be interesting to refine the model so that reads are always 
correct if there is no concurrent write, but only correct with probability p when 
a read overlaps a write. The weakly stabilizing protocol does not guarantee clo- 
sure, since there is always the possibility of a read returning an incorrect value. 
The usefulness of this kind of weakening of stability is typically associated with 
control theory, e.g., in the goal is to find systems such that all trajectories 
visit the “good” states infinitely often. Though rarely formalized in the litera- 
ture of self-stabilization, this reasoning is implicit in many papers: each new fault 
perturbs the system, and provided the execution is fault-free for long enough, 
the system returns to a legitimate state. Our probabilistic model quantifies both 
fault and scheduler probabilities to characterize executions with a Markov model. 
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Abstract. This paper presents the first algorithm for implementing self- 
stabilizing group communication services in an asynchronous system. 
Our algorithm converges rapidly to a legal behavior and is communi- 
cation adaptive. Namely, the communication volume is high when the 
system recovers from the occurrence of faults and is low once a legal 
state is reached. The communication adaptability is achieved by a new 
technique that combines transient fault detectors. 



1 Introduction 

Group communication services are becoming widely accepted as useful build- 
ing blocks for the construction of fault-tolerant distributed systems and com- 
munication networks. Designing robust distributed systems is one of the most 
important goals in the research of distributed computing. One way to simplify 
the design and programming process of a distributed system is to design a use- 
ful set of programming high-level primitives that forms a robust set of tools. A 
group communication system enables processors that share a collective interest, 
to identify themselves as a single logical communication endpoint. Each such 
endpoint is called a group, and is uniquely identified. A processor may become 
a member of or departure a group, by issuing a join/leave request to the group 
membership service. The membership service reports membership changes to the 
members, by delivering views. A view of a group includes a set of members and 
a unique identifier. A processor may send a message to a group, using a group 
multicast service. 

A very important property in the implementation of the primitives of group 
communication services is its fault tolerance and robustness. It is assumed that 
processors leave and join the group either voluntarily or due to crashes or re- 
coveries. The distributed algorithms that implement these services assume a 
particular set of possible failures, such as crash failures, link failures or messages 
loss. The implementing algorithms should provide the specified services in spite 
of the occurrence of these failures. The correctness of the implementing algo- 
rithms is proved by assuming a predefined initial state and considering every 
possible execution that involves the assumed possible set of failures. 

* Dolev’s work was supported by BGU seed grant. 
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The abstraction that limits the possible set of failures is convenient for es- 
tablishing correctness of the algorithm, but it can at the same time be too 
restrictive. Group communication service is a long-lived on-line task and hence 
it is very hard to predict in advance the exact set of faults that may occur. 
Therefore, it may be the case that due to the occurrence of an unexpected fault, 
the system reaches a state that is not attainable from the initial state by the pre- 
defined transitions (including the predefined fault occurrences). Self-stabilizing 
systems (mini can be started in any arbitrary state and exhibit the desired 
behavior after a convergence period. Thus, self-stabilizing group communication 
services can automatically recover following the occurrence of such unexpected 
faults. 

Related Work: The Isis system |S| is the first implementation of group commu- 
nication services that triggered researchers and developers to further examine 
such services. Cristian 0| formalized a definition of group communication for 
synchronous systems. Group communication services were implemented with dif- 
ferent guarantees for reliability and message delivery order. For example, Isis |S|, 
Transis Totem Newtop Relacs |3, Horus m and Ensemble m- 
None of the above implementation is self-stabilizing. A specification that guar- 
antees performance once the system stabilizes to satisfy certain properties is 
presented in m- This is a consequence of existing impossibility results for re- 
quirements that hold in all possible executions e.g., m- Still it is assumed in m 
that the system is started in a certain global state, and the transitions are from 
a predefined set of transitions — thus the specification and algorithm presented 
in |2D| are not designed for self-stabilizing systems. 

A different approach (part of which is randomized) is used in m- Every 
processor periodically transmits a list of the processors that it can directly com- 
municate with. A processor is consider “up” and connected as long as it can 
successfully transmit a “fresh” time-stamp; otherwise it will be eventually dis- 
carded from the system. The algorithm presented in 1221 may be a base for a 
self-stabilizing algorithm, if for example, each processor has access to a local 
pulse generator, such that the maximum drift between the pulse generators is 
negligible. 

Gongress |2| is an elegant protocol for registration of membership informa- 
tion at (hierarchically organized) servers. Hierarchy of servers improves scala- 
bility. Users send a message to servers with join or leave requests. The servers 
maintain the membership information. The design fits wide area network using 
virtual links to define neighboring relation. 

Moshe [2 Ij is a group membership service implementation, that considers an 
abstract network service (such as Gongress 0). The network service monitors the 
up and connected members of every group and delivers multicast messages to the 
members of a group. The common cases of membership changes (joins/leaves) 
are considered in order to achieve scalability. The group membership algorithm 
of Moshe uses unbounded counters. 

A self-stabilizing group membership service for synchronous systems is con- 
sidered in PI . A common periodic signal initiates a broadcast of local topology 
of every processor. Every processor uses the local topologies in order to compute 
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the connected component to which it belongs. Unbounded signal numbers are 
used and changes in the group are discovered following a common signal. 

Our Contribution: this paper presents the first algorithm for implementing a 
self-stabilizing group membership service in asynchronous systems. We assume 
that processors eventually know the set of non-crashed processors with which 
they can communicate directly. We show that once every processor knows the 
correct set, the membership task is achieved within the order of the diameter of 
the communication graph. Moreover, the activity of the processors is according 
to the required group membership service. Our algorithm converges rapidly to a 
legal behavior and is communication adaptive. Namely, the communication vol- 
ume is high when the system recovers from the occurrence of faults and is low 
once a legal state is reached. The communication adaptability is achieved by a 
new technique that combines transient fault detectors. Furthermore, randomized 
techniques can be used to dramatically reduce the communication complexity of 
the deterministic transient failure detector. 

Our group membership service can be extended to implement different levels 
of broadcast services, such as single-source FIFO; totally ordered; and causally 
ordered. 

The rest of the paper is organized as follows. The system settings appear in 
Section 13 Our algorithms for implementing a self-stabilizing group membership 
service appear in Sectional Concluding remarks are in Section 0 The proofs are 
omitted from this extended abstract, more details can be found in m- 

2 The System 

The distributed system consists of a set of P communicating entities. We call 
each entity a processor, and assume that 1 < \P\ = ri < N, where N is an upper 
bound on the number of processors. The processors may represent a network of 
real physical CPU, or correspond to an abstract entity like a process or thread in 
a timesharing system. Processors are connected by communication links through 
which they communicate by exchanging messages, neighhorsi is the set of pro- 
cessors that processor pi can directly communicate with. The communication 
link may represent a (real physical) communication channel device attached to 
the processor, a virtual link, or any inter-process communication facility (e.g., 
UDP, or TCP connections). 

It is convenient to represent a distributed system by a communication graph 
G = {V,E), where each node represents a processor and each edge represents a 
communication link. Let Pi,Pj S P, Pj € neighbor Si iff (pi,pj) € E. 

The system is asynchronous. We assume however that processors eventually 
identify the crashed/non-crashed status of their attached links and neighbors. 
We sometime use the term time-out in the code of the processors for a repeated 
action of the processors. In fact, a zero time-out period will result in the desired 
behavior as well. The time-out period may only reduce the number of messages 
sent when processors have access to a time device. 

A state machine models each processor. The communication links are mod- 
eled by two anti-directed FIFO queues. We use a (randomized) self-stabilizing 
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data- link algorithm on every link The existence of the self-stabilizing data- 
link algorithm ensures that when a message is sent it arrives to its destination 
before the next message is sent. Thus, input communication buffers or com- 
munication registers (when buffers contain at most one message and when the 
content of an arriving message replaces the previous content of the buffer) can 
be assumed whenever it is convenient instead of message passing. 

The system configuration is a vector of the states of the processors and the 
values of the queues (the messages in the queues). 

A communication operation is an operation in which a message is sent or a 
message is received. We also allow a processor to send the same message to every 
one of its neighbors in a single communication operation. A step of a processor 
consists of internal computations that are followed by a single communication 
operation. A system execution is an alternating sequence of configurations and 
(atomic) steps. 

Processors may crash and recover during the execution. The neighbors of a 
crashed processor eventually identify the fact that it is crashed. 

The program of a processor used here consists of a do-forever loop that in- 
cludes communication step with every neighboring processor. Let R be an exe- 
cution and let A be a connected component of the system such that no processor 
in A is crashed during R. The first asynchronous cycle of A in i? is the minimal 
prefix of R such that each processor pi in A communicates with every of its 
neighbors: At least one message rrij is sent by pi to every neighbor pj, such that 
Pj receives mj during the asynchronous cycle. 

The number of messages sent over a particular communication link during an 
asynchronous cycle is a function of the number of loop iterations the attached 
processors execute during this asynchronous cycle (note that a processor may 
execute any number of iterations before another processor completes a single 
loop iteration). Thus we consider a special execution to measure the communi- 
cation complexity of an algorithm. A very fair execution is an execution in which 
every processor executes exactly a single iteration of its do-forever loop in every 
asynchronous cycle. The communication complexity is the total number of bits 
communicated over the communication links in a single asynchronous cycle of a 
very fair execution. 

The set of legal executions includes all the executions that exhibit the desired 
behavior (input output relation) of the system for a task r. For example, if r is 
the mutual exclusion task, then at most one processor is executing the critical 
section in any configuration of a legal execution. 

A safe configuration of the system is a configuration from which only legal 
executions, with respect to r, start. 

In this paper, the requirements are related to the eventual behavior of the 
system when the execution fulfills certain properties (unlike the requirements 
discussed in cni see also |23). We require that a self-stabilizing algorithm for 
group communication service will reach a safe configuration within a certain 
number of asynchronous cycles in any execution (that starts in an arbitrary con- 
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figuration) such that each processor pi has a fixed set of non crashed neighbors 
during the executiorfl. 

We allow simultaneous existence of several groups. We do not consider how- 
ever, interaction between the groups. Therefore, we choose a specific group g 
for describing the membership service. A boolean variable memberi (logically) 
represents the intention of pi to be included in g. A partition of the network may 
cause a “partition” of g as well. Therefore, we associate the set of legal execu- 
tions for the group membership task, with the processors of a (fixed) connected 
component A, and include execution R such that the following properties hold: 

1. If the value of memberi = true {memberi = false) is fixed during R then 
there exists a suffix of R, in which pi appears (does not appear, respectively) 
in all the views of group g in the connected component A. 

2. If the value of memberi of every processor pi of group g in the connected 
component A is fixed during R then there exists a suffix, in which all the 
views of group g in connected component A are identical, the views have the 
same list of members and the same view identifier. 

We note that the length of the prefix of R before the suffix mentioned in the 
above requirement is achieved by our algorithms is 0{d) (which is the fastest 
possible) . 

The communication of a self-stabilizing algorithm is adaptive if the maximal 
communication complexity after reaching a safe configuration is smaller than the 
maximal communication complexity before reaching a safe configuration. 

3 Self-Stabilizing Group Membership Service 

In this section we present the first communication adaptive self-stabilizing al- 
gorithm for the membership service. Roughly speaking, a spanning tree of the 
system is constructed. This tree is used to execute the membership management 
tasks. The root of the tree is responsible for the management of the membership 
requests, and establishing new views. Several transient fault detectors monitor 
the consistency of the tree and the membership information. The transient fault 
detectors give fast indication on the occurrence of transient faults. Once a fault is 
detected the system changes state to a safe configuration executing a propagation 
of information with feedback {PIF) procedure |28f I .'Ij . for several times (choosing 
random identifiers for these executions to ensure eventual stabilization). 

The update algorithm informs each processor with the nodes in its connected 
component. The update algorithm stabilizes fast, as it takes 0{d) asynchronous 
cycles before reaching a safe configuration. We use d to denote the actual diam- 
eter of the connected component. Unfortunately, the communication complexity 
of the update algorithm is 0{\E\n\ogn) before and after a safe configuration is 
reached. In this section we present an algorithm that reduces the communication 
complexity to 0(|if | log n -|- logn) = 0(n^ log n) once a safe configuration is 
reached. 

^ We note that we do not consider the time required to identify the status of the links 
and neighbors. 
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A transient fault detector is composed with the update algorithm to achieve 
the communication adaptability property (see WEI for definitions of transient 
fault detectors). The transient fault detector signals every processor whether or 
not it needs to activate the update algorithm. Our transient fault detector itself 
is obtained by using a new technique for composing transient fault detectors. 

Roughly speaking, whenever a processor detects, by use of the transient fault 
detector, that the update algorithm is not in a safe configuration, the processor 
signals the processors in the system to start the activity of the update. The 
processor stops signaling the other processors to operate the update algorithm 
when it receives an indication that a safe configuration is reached. 

3.1 Self- Stabilizing Update 

We use the self-stabilizing update algorithm of p^l I bj . We now sketch the main 
ideas used by the update algorithm. We start with the data structure used by a 
processor. Each processor has a list of no more than N tuples {id, dis, parent) . 
When the update algorithm stabilizes it holds that the list of a processor pi 
contains n tuples, exactly one tuple (j, dis, k) for each processor pj that is in the 
same connected component with pi. The value of dis is the number of edges in a 
shortest path from pi to pj and pk is a neighbor of pi that is in a shortest path 
to Pj. Thus, when the algorithm stabilizes every processor knows the identities 
of the other processors in its connected component. 

The processors that execute the update algorithm repeatedly receive all the 
tuples from the tables of their neighbors and use the value received to calculate a 
new table (note that the current table is not used in calculating the new table). 
Every time a processor pi finishes receiving the tuples of its neighbors it acts as 
follows: Let TUi be the set of all tuples that a processor pi reads from its neigh- 
bors. Pi adds 1 to the dis field of every tuple in TUi. pi adds a tuple {i,0,nil) 
to TUi- If there are several tuples with the same id in the resulting TUi then pi 
removes every such tuple except a single tuple among these tuple, a tuple with 
the minimal dis value. Finally, pi removes every tuple {id, dis, parent) such that 
there exists a positive z < dis and there is no tuple with dis = z in TUi. The 
resulting set in TUi is the new table of pi. 

3.2 Transient Fault Detectors for Reducing Communication 
Overhead 

The communication complexity of the update algorithm is 0{\E\n\ogn) . Note 
that a naive approach for designing a transient fault detector is to repeatedly 
send TUi to every neighbor. A fault will be detected whenever there should be 
a change in the value of TUi (according to the update algorithm) when a mes- 
sage with TUj arrives from a neighboring processor pj . This approach results in 
communication complexity that is identical to the communication complexity of 
the update algorithm. 

In this section we present a fault detector that reduces the communication 
complexity of our algorithm when the algorithm stabilizes (reaches a safe config- 
uration) . The communication complexity of the algorithm when a fault detector 
is used is 0(n^ log n). 
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The update algorithm informs each processor with the nodes in its connected 
component. The task of the transient fault detector is to detect a fault whenever 
there exists at least one processor that does not know the set of processors in 
its connected component. 

We present a new scheme for combining fault detectors that results in low 
communication complexity. In order to reduce the communication complexity, 
we combine two transient fault detectors. The first one communicates short mes- 
sages over all the links of the system and ensures that there is a marked rooted 
spanning tree. The short messages consist of the identifier of the common leader 
and the distance of the processor from this leader. The second transient fault 
detector assumes the existence of a spanning tree and communicates larger mes- 
sages over the links of this tree. In fact, these messages consist of the description 
of the rooted spanning tree. 

Transient Fault Detector for the Existence of a Tree Rooted at a 
Leader: The code for the first part of the transient fault detector appears in 
Figure n In the code we use the input {leaderi, disi,parenti) which is defined 
by the output of the update algorithm. Let {l,d,p) be the tuple in TUi^ such 
that I is the maximal value among the values of the leader variables in TUi. The 
value of {leadevi, disi, parenti) is assigned by the values of (l,d,p). A change in 
the value of {leadevi, disi,parenti) as well as in the neighborsi set triggers fault 
detection. 

Lines 1 and la of the code ensure that the information for detection of a 
fault is sent from every processor to its neighbors once every timeout period. 
Line lb ensures that the processor for which leadevi = i has the value 0 in its 
dis variable and the value nil in the parenti variable. Line 2a ensures that all 
the processors have the same value in their leader variable and the distance of 
the parent of each (non-leader) processor pi is one less than the distance of pi 
from the leader. 



Input: {leaderi,disi, parenti) (* updated by lower level *) 

1. Upon timeout: 

(a) for each j £ neighborsi send {leadert, disi, parenti) . 

(b) if leaderi = i and {disi 7^ 0 or parenti ^ nil) then fault is detected. 

2. Upon receiving (l,d,p) from pj 

(a) if {I 7^ leaderi) or {{j = parenti) and {disi ^ d + 1)) then fault detected. 



Fig. 1. Transient Fault Detector of pi, for the Existence of a Tree Rooted at a 
Leader. 

Define the directed graph T = (V,E) as follows: each node of the graph V 
represents a processor in the system (and vice versa). There exists a directed 
edge (i,j) G if if and only if the value of the parent field of the processor pi, in 
the tuple of with the maximal id, is pj. 
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Definition 1. A directed graph is an in-tree if the undirected underlying graph 
is a tree and if every edge of the tree is directed towards a common root. 

To prove the correctness we show that if no processor detects faults during 
an asynchronous cycle, then T is an in-tree rooted at the common leader (the 
processor with the maximal identifier). 

The Tree Update Algorithm: Before we continue with the transient fault 
detector, let us add a mechanism to distribute the description of T to every 
processor in the system. We augment each processor pi with a variable % that 
should contain the description of T. 

Let %{pj) be the component of 7) that is connected to pi when the link 
from Pi to pj is removed from 7). pi repeatedly sends %{pj) to every proces- 
sor pj G {{parenti} U childreui). parenti is defined by the value pj of the tu- 
ple {l,d,pj) in TUi such that I = leaderi. The childreui set includes every 
neighbor pj from which the last table TUj received, includes a tuple (l,d,pi) 
where I = leader i. pi repeatedly computes % using the last values of 'Tj{pi) 
received from every processor pj G {{parenti} U childreni). pi construct % 
from the above Tj{pi) adding the links connecting itself to the processors in 
{{parenti} U childreni). 



Input: Ti, parenti, childreni (* updated by lower level *) 

1. consistent <— true 

2. if Ti does not encode a spanning in-tree then consistent <— false 

3. if childreni is different from the set of processors that are the 
children of pi in T then consistent ^ false 

4. if parenti = nil and pi is not the leader of T then consistent <— false 

5. if parenti A nil and parenti is not the parent of pi in T 
then consistent <— false 

6. return consistent 



Fig. 2. Consistency Test Function. 

We now prove the correctness of the tree update algorithm. In the proof we 
consider an execution that starts in a safe configuration of the update algorithm 
and prove correctness of the tree update in such executions. A safe configuration 
of the update algorithm is a configuration in which the values of the tuples of all 
the processors are correct (and therefore are not changed in any execution that 
starts in such a safe configuration) . 

In the lemma we use the term height of a processor pi in an in-tree for the 
maximal number of edges in a path from a leaf in the tree to pi, such that the 
path does not include the root of the tree. 

Lemma 1. Consider any execution R of the tree update algorithm that starts in 
a safe configuration of the update algorithm and consists of at least I 1 asyn- 
chronous cycles. Let Pj be a processor such that T{pj) is a suh-in-tree ofT (the 
in-tree defined by the update algorithm) that is rooted at pj, and the height of 
T{pj) is at most 1. Let Tj{pj) be the description of the tree rooted at pj in the 
variable Tj ofpj. Lt holds that T{pj) = 'Lj{Pj) in, the last configuration of R. 
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A configuration, c, is safe with relation to the tree update algorithm iff c is 
safe for the update algorithm and for every processor pi % = T . Moreover, in 
any execution that starts in c the value of % is not changed (this last require- 
ment implies, in fact, that any message in transit from pi to pj contains %{pj) 
that is the portion of T connected to pi when the link from pi to pj is removed) . 

Corollary 1. The tree update algorithm reaches a safe configuration follow- 
ing the first 0{d) asynchronous cycles and its communication complexity is 
0{n^ logN). 

Transient Fault Detector for Correct Description of the Tree: The sec- 
ond transient fault detector assumes the existence of a rooted spanning tree T 
that is defined by the child parent relation and ensures that every processor pi 
has the description of T in 7). Thus, ensures that every processor knows the set 
of processors in its connected component. 

Let us first describe the consistency test function in Figure O that is used by 
our transient fault detector. In the code we use the input Ti^parenU, childreui 
which is defined by the output of the tree update algorithm. The consistency 
test function uses a boolean variable consistent. First pi assigns true to the 
consistent variable (line 1 of Figure |2I). In line 2, pi checks 7) to be a spanning 
in-tree — a directed tree for which every edge is directed towards a common 
root. Lines 3, 4 and 5 test the child parent relations of pi (according to the 
update algorithm) to be correct in 7). The function returns the final value of 
consistent. 

The transient fault detector is presented in Figure 0 The fault detector will 
ensure that all local values of T are identical and that every processor local tree 
neighborhood appears in T. In the code we use the input parentis childreni 
which is defined by the output of the tree update algorithm (see the description 
of the code of Figure El for the values of the above inputs). 

Pi repeatedly executes line la and lb. In line la pi sends % to its parent 
and children, pi checks the consistency of 7) according to the consistency test 
described in Figure 0 and detects a fault accordingly. Whenever pi receives Tj 
from Pj, Pi checks whether 7) = 7) and detects a fault if this equation is not 
true (line 2a of the code) . 



Input: Ti,parenti, childreui (* updated by lower level *) 

1. Upon timeout: 

(a) for each j £ {parenti} U childreui send % to Pj. 

(b) if Ti is inconsistent then detected a fault. 

2. Upon receiving 7) from pj 

(a) if Ti Tj then detected a fault. 



Fig. 3. Transient Fault Detector of pi, for Correct Description of the Tree. 

We prove the correctness of the second fault detector assuming that no fault 
is detected by the first transient fault detector. 
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Lastly, we combine both the fault detectors; the first fault detector messages 
are augmented with the second fault detector messages. (Note that the second 
fault detector sends messages only on tree links. The messages sent by the first 
fault detector on non-tree links are not augmented by a message of the sec- 
ond fault detector). We conclude the presentation and correctness proof of the 
transient fault detectors by the following corollary. 

Corollary 2. (1) The combined fault detector detects a fault during a single 
asynchronous cycle whenever there exists a processor pi such that % does not 
consist of the processors inpi’s connected component, and (2) The communica- 
tion complexity of the combined fault detector is 0(n^ log TV). 

3.3 Lower Bound on the Communication Complexity 

We now present a lower bound of I7(n^ log(fV/n— 1)) bits on the communication 
complexity. The lower bound is for any fault detector that detects a fault within 
a single asynchronous cycle (whenever a processor has an inconsistent knowledge 
on the set of processors in its connected component or view) . Recall that group 
membership services notifies the application with the current view. To do so we 
consider an asynchronous cycle that starts with all processors sending to every 
one of their neighbors (where a processor can send nil messages in case no mes- 
sage should be sent to a neighbor), and the cycle terminates after all messages 
sent are received. We examine processors pi,p 2 , ■ • • Pn that are connected by a 
chain communication graph. Assume that n is even (a similar argument can be 
used for a chain with an odd number of processors). 

Let mk,k+i {‘nik+i,k) be the message sent from topfc+i (frompfc-i-i to pfe, re- 
spectively). We claim that the number of distinct combinations of mk,k+i, Tnk+i,k 
must be at least Q{nlog{N/n — 1)). 

Let pk be a processor in the chain and suppose that k < n/2. Fix a set of k 
distinct identifiers for the processors pi,p 2 , • • • ,Pk- We prove a lower bound by 
using the number of possible choices of different sets oi n — k distinct identifiers 
for the rest of the processors pk+i,Pk+ 2 , ■ ■ • ,Pn- 

Let Xi and X 2 be two such choices. Now we describe two different systems 
that differ in the way we assign identifiers to processors pk+i,Pk+ 2 , ■ ■ ■ ,Pn- The 
identifiers of the processors pk+i,Pk-i- 2 i • • • ,Pn in the first (second) system are 
the identifiers in X\ {X 2 , respectively). Clearly the communication over the edge 
connecting pk to Pfe+i must not be the same in the two systems above. Other- 
wise we may replace the two different portions of the two systems and no fault 
will be detected, while pi is not aware of the different set of processors in the 
system. The case of kn/2 is handled analogously fixing a set of k distinct identi- 
fiers for the processors pn-k,Pn-k+i^ ‘ ‘ ' ^Pn- In both cases we conclude that the 
number of communication patterns needed are at least the number of choices 
of n — k distinct identifiers for the processors pfe+i,pfc+ 2 , • • • ,Pn, out of N — k 
identifiers. {N — ky./{{n — ky.{{N—k) — {n — k)y.) = (N — fc)!/((n — fc)!(N — n)!) = 
{{N — n-\- 1) ■■■ {N — k))/{n — k)l > {{N — n l)/{n — /c))”“^. We assume that 
N > 2n, thus we have that {N — n l)/{n — k) > 1. By the assumption that 
1 < A: < n/2, we have that ((iV — n-|-l)/(n — A:))"“^ > ((fV — n)/n)"/^. Therefore, 
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for the communication between mk,k+ii'^k+i,k, at least f2(nlog{N/n — 1)) bits 
are needed. The communication complexity is a measure that considers all the 
links and therefore is I7(n^ log(iV/n — 1)) bits. 

3.4 Group Membership and Voluntarily Join/Leave 

In a legal execution, only the user is privileged to change his/hers membership 
status in a group. Such a change occurs in response to the application requests. 
Here we describe how, in a legal execution, processor pi may join (leave) a group 
g by locally setting (resetting, respectively) member i. 

We use the self-stabilizing /^-synchronizer algorithm m to coordinate view 
updates. The /3-synchronizer is designed to be executed on a spanning tree of the 
system, in our case T. There are two alternating phases for the /3-synchronizer, 
propagation phase and convergecast phase. In a legal execution, processor pi 
(the root of T) is responsible for the membership updates. During the propaga- 
tion phase. Pi propagates the view it maintains vi. As vi propagates through T, 
every processor pi assigns vi to a local variable Vi that maintains its view. The 
value of membevi of every processor pi is accumulated during the convergecast 
phase. The value of memberi of a leaf in T is delivered to pk the parent of p; . A 
parent of a leaf processor pk, concatenates the values of the member c, received 
from its children pc, together with membevk and delivers it to its parent, and 
so on. Once the convergecast phase terminates, the root sends the received con- 
catenated information on the membership of all the processors, together with 
a view identifier (the view identifier is changed whenever the set of members is 
changed) . 

A transient fault detector monitors the consistency of the join/leave and 
membership information. Details are omitted from this extended abstract. 

Before we turn to describe the actions taken upon a fault detection let us 
note that randomized transient failure detector can be used as well. In a legal 
execution, our deterministic transient failure detector repeatedly sends the same 
message through each link. Thus, the randomized technique proposed in 
that uses a logarithmic size of the repeatedly sent message can be used here to 
further reduce the size of the messages sent. In such a case the failure detectors 
will detect a fault with high probability. 

3.5 Fast Convergence 

So far we have discussed transient fault detection, without describing the action 
taken when a fault is detected. The goal of the technique presented here is to 
ensure a fast convergence in the cost of a higher communication complexity. 
Once the transient fault detector detects a fault, we would like to activate the 
self-stabilizing tree update algorithm to regain consistency as soon as possible 
and then switch back to use transient fault detector. 

Propagation of a Fault Detection: Once a fault is detected by a processor 
Pi, the processor propagates the fault indication to every other processor. Every 
tuple of the update tables is extended to include a state field, where the domain 
of the state is {safe, dtc, act}. We use the term the source tuple of pi for the 
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single tuple of TUi with i in the id field, pi starts the propagation by assigning 
the values (i, 0, nil^ dtc) in its source tuple. In the sequel we use the fourth field 
of Pi’s source tuple as the state of pi. 

Every processor pj that has at least one state field in a tuple of TUj with a 
value not equal safe executes the update algorithm, sending messages through 
every attached link. When pi sends the new value of TUi to pj and the state 
of Pj is safe, pj changes it’s state to dtc. The information on the fact that pt 
detected a fault propagates to the entire system in the same way. 

Our goal is to ensure that every processor p^ verifies that the tuples in the 
tables of the processors encode a fixed BPS tree rooted at pk , and therefore the 
update algorithm is in a safe configuration. Then we allow the system to switch 
back to use the transient fault detector. 

A central tool in achieving an indication on the completion of the recon- 
struction of the BPS trees is the PIF procedure. The propagation is done by 
flooding the system with the new information in the way we described above 
(for the case of dtc). The propagating processor, pi, should receive a feedback 
on the completion of the propagation before finalizing the PIF procedure. The 
feedback is sent to a processor with a smaller distance from pi, which pi selects 
to be its parent in the tree. Every processor pj uses the distance variable of the 
tuple with id = i in TUj as its (upper bound on the) distance to pi. 

A processor pj sends a feedback only when the maximal distance difference 
of Pj to Pi, and the distance of any neighbor pk of pj to pi is 1. The fact that the 
value in the distance fields is an upper bound on the distance from pi guarantees 
that every neighbor pj of pi sends feedback when the value of its distance field is 1 
and therefore has a fixed parent (namely, pi). Moreover, pj sends a feedback only 
when every of its neighbors has distance of at most 2. Thus, processors of dis- 
tance 2 have the correct distance and therefore a fixed parent. Similar arguments 
hold for processors of greater distances, concluding that a fixed BFS rooted at 
Pi exists when pi receives the feedback. More details can be found in We 
note that part of the new information that is propagated is a randomly chosen 
color that identifies (with high probability) the current PIF execution initiated 
by Pi, as & new PIF execution. 

The fast convergence algorithm should ensure stabilization from an arbitrary 
state. We trace the activity of the system from the first fault detection. We would 
like the fault detection to ensure that every processor will start a PIF following 
the fault detection. Then, when every processor completes the PIF and verifies 
that its tree, is a fixed BFS tree we can stop executing the communication 
expensive tree update algorithm. 

When Pi detected a fault it starts a PIF that causes every processor pj either 
(1) to change a state from safe to dtc and start a PIF or (2) when pj is in the 
state act to execute at least one more complete PIF before changing state to 
safe. 

The update algorithm is executed by pi whenever there exists a tuple in TUi 
with a state field not equal safe. Otherwise, pi responds to any TUj message 
(sent by a neighbor pj) by recomputing TUi accordingly, and sending TUi to 
Pj exclusively. (Note that the transient fault detector is disabled whenever there 
exists a tuple in TUi with a state field not equal safe). 
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We may conclude that: once a transient fault is detected and propagated to 
the entire system it holds that (1) no processor is in a safe state, and (2) no 
transient fault is detected. 

Upon Completing the Propagation of a Fault Detection: A processor pi 
that has completed propagating the fault indication (completing a single PIF) 
changes state to act. Then pi waits for all other processors to complete their 
propagation of fault detection, reaching a system state in which no processor 
(uses the failure detector to detect a fault and) starts propagating an indication 
of a failure. In other words, when pi is in act state pi repeatedly executes PIF 
until it receives an indication that no dtc tuple appears in any table. 

The indication for the absence of dtc tuples, is collected using a PIF query. 
The PIF procedure is used to query the values of the state fields using the follow- 
ing procedure: Every tuple of the update tables is extended to include a nodtc bit 
field. When pk chooses a new color, pk set the nodtc bit true, and starts a PIF. 
A processor pj sets the nodtc bit of every tuple in its table to false, whenever 
there exists a tuple with the state dtc in FUj . Whenever pj sends feedback to its 
parent (as part of the PIF) pj sends also the and result of the nodtc bit values 
of its children tables and its own table. Thus, a single nodtc = false results in 
a nodtc = false feedback that arrives to pk. 

We may conclude that once the nodtc PIF query procedure is completed 
with nodtc = true, then no processor is in a dtc state (and the transient fault 
detectors are disabled). Furthermore, let pk be the first processor that changes 
its state from act to safe, after processor pi had notified a fault detection. Let 
c be the configuration that immediately follows this state change of pk. We will 
prove that, (1) the tree rooted at each processor in c is a fixed BFS tree, (2) 
the state field of every tuple in every table in c is act and, (3) no transient fault 
is detected. 

Returning to Normal Operation: Once all the processors are in act state 
the system is ready to return to normal operation. A processor pi changes state 
to a safe state when pi is in act state and finds out that no dtc state exists in 
the system. Still pi does not activate the transient failure detector until all pro- 
cessors change state to a safe state, pi repeatedly executes PIF queries until it 
finds that the state of all the processors is safe. Thus, when a processor returns 
to use the transient failure detector all the processors are in a safe state and 
therefore a fault detection will result in a global state change to dtc, then to act 
and at last to safe after reaching a safe configuration. 

The PIF query initiated by a processor in a safe state uses an additional 
allsafe bit field. When pk chooses a new color, pk set both the nodtc and the 
allsafe bits to true, and starts the PIF procedure. Recall that a processor pj 
sets nodtc bit to false, whenever there exists a tuple with a dtc state in FUj. 
In addition, pj sets the allsafe bit to false, whenever there exists a tuple with a 
state not equal to safe in FUj . pk changes it state to dtc whenever there is a dtc 
tuple in FUk, or a feedback with nodtc = false arrives. If the feedback carrying 
the allsafe bit is true, then pk stops executing the update algorithm, and starts 
using the transient fault detector. If the allsafe bit is false (and the nodtc bit is 
true) then pk assigns true to both nodtc and allsafe bits, and repeats executing 
the PIF query. 
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We note that the tree description used by the transient fault detector should 
be identical in all the processors before switching back to normal operation. 
Thus, the allsafe bit is also used to indicate that the tree description of a pro- 
cessor and its neighbors are identical (otherwise the allsafe bit that arrives in 
the feedback is false). 

We may conclude that when the feedback of the allsafe PIF query is true, it 
holds that all the processors are in a safe state. Furthermore, let pk be the first 
processor that returns to use the transient fault detector, after pi propagated a 
fault detection. Let c be the configuration in which pk returns to use the tran- 
sient fault detector. Then in c it holds that the system is in a safe configuration 
with relation to the update algorithm. 

We now turn to a detailed presentation of the fast convergence algorithm. 
The code of the fast convergence algorithm appears in Figure 01 In the code, 
we use the PIF and the PIF query procedures. A formal description of the PIF 
procedure can be found in M- The PIF procedure is extended to PIF queries 
{nodtc and allsafe queries) as described above. 

Lines 1, 2 and 3 of the code describe the actions pi takes according to it’s 
state. When pi is in a die state (line 1), pi executes a PIF (line la), once the 
PIF is completed pi changes its state to act (line lb). When pi is in act state 
(line 2), pi repeatedly executes a PIF query to ensure that no dtc tuple exists 
in the system (line 2a). Then, pi changes its state to safe (line 2b). In a safe 
state (line 3), pi repeatedly executes a PIF query to ensure that all the states 
(of the processors and the state fields of the tuples) are safe states (line 3a). If 
there exists a dtc tuple, then pi changes its state to dtc (line 3b) . If indeed there 
are only safe tuples in the system then, pi returns to use the transient failure 
detector (line 3c). Once the failure detector is operating, pi changes its state to 
dtc when a fault is detected (line 3c and 3d). 



1. state=dtc — (* Notify *) 

(a) Execute PIF 

(b) state <— act 

2. state=act — (* Finish Notification *) 

(a) Execute PIF query 

until no dtc in the system 

(b) state <— safe 

3. state=sa/e — (* Back to TFD *) 

(a) Execute PIF query 
until all safe or exists dtc 

(b) if PIF query results dtc then 
state <— dtc 

(c) else execute transient failure 
detector until fault detection 

(d) state <— dtc 



Fig. 4. Fast convergence algorithm of pi. 
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4 Concluding Remarks 

This paper present the first asynchronous self-stabilizing group membership ser- 
vice. We believe that the new ideas presented in this paper will enrich the set of 
techniques used in the design of robust group communication services. For ex- 
ample, we do not utilize the idea of token passing for detecting a crash. Instead 
we present a self-stabilizing scheme that detects a fault fast (in a single asyn- 
chronous cycle) and is still communication efficient. Our membership service can 
serve as the base for additional group communication services. 
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Abstract. In an asynchronous system, where processes can crash, per- 
fect predicate detection for general predicates is difficult to achieve. A 
general predicate thereby is of the form a A P, where a and /3 refer to 
a normal process variable and to the operational state of that process, 
respectively. Indeed, the accuracy of predicate detection largely depends 
on the quality of failure detection. In this paper, we investigate the pred- 
icate detection semantics that are achievable for general predicates us- 
ing either failure detector classes OOV, OV, or V. For this purpose, 
we introduce weaker variants of the predicate detection problem, which 
we call stabilizing and infinitely often accurate. We show that perfect 
predicate detection is impossible using the aforementioned failure de- 
tectors. Rather, OV and V only allow stabilizing predicate detection. 
Consequently, we explore alternative approaches to perfect predicate de- 
tection: introducing a stronger failure detector, called ordered perfect, or 
restricting the general nature of predicates. 



1 Introduction 

Testing and monitoring distributed programs involves the basic task of detecting 
whether a predicate holds during the execution of the system. For example, a 
software engineer might want to detect the predicate “variable x has changed to 
value 2” to find out at what point in the execution x takes on a bad value. Pred- 
icate detection in distributed settings is a well-understood problem and many 
techniques together with their detection semantics have been proposed |7] . How- 
ever, most of the techniques have been proposed under the assumption that no 
faults occur in the system. Hence, most of the methods proposed in the litera- 
ture are not robust in the sense that they offer no guarantees if faults such as 
message losses or process crashes occur in the system. 

In an asynchronous system where processes can crash, a general predicate de- 
tection mechanism should also detect these crash events. Chandra and Toueg 0 
proposed to encapsulate the functionality of failure detection into a separate 
module and specify it using axiomatic properties. Such a failure detector can be 
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used to locally maintain information about the operational state of the processes. 
Based on the quality of failure detection, different classes of failure detectors can 
be defined. Most relevant to this paper are the classes of perfect, eventually per- 
fect, and infinitely often accurate failure detectors (denoted V, OV, and OOV, 
respectively) 

Standard predicate detection techniques aim at monitoring conditions which 
are composed of predicates on the local state space of processes |Z| ■ For instance, 
a predicate Xi = 1 A yt = 2 evaluates on the variables x and y in the local state 
of process pi. In this paper, we consider the detection of predicates that are 
boolean combinations of predicates on the local state and predicates on the op- 
erational state of a process. This allows us to evaluate predicates of the form 
Xi = 1 A crashedi, where crashedi is a predicate that is true iff (if and only if) 
application process pi has crashed. Ideally, a predicate detection algorithm never 
erroneously detects such a predicate and does not miss any occurrence of the 
predicate in the underlying computation. However, the quality of the underlying 
failure detector severely limits the quality of predicate detection. We show that 
perfect predicate detection is generally impossible with failure detectors of type 
nOV and OV. Rather surprisingly, the impossibility still holds for V. We investi- 
gate weaker variants of predicate detection which we call stabilizing and infinitely 
often accurate. Briefly spoken, a predicate detection algorithm is stabilizing if 
it eventually stops making false detections and it is infinitely often accurate if 
it has infinitely many phases where it does not issue false detections. We also 
investigate two conditions under which perfect predicate detection is solvable. 
The first is the existence of a novel type of failure detector which we call ordered 
perfect (denoted V) and which is strictly stronger that V. The second condition 
imposes restrictions on the generality of predicates. 

Apart from clarifying the relation between predicate detection and failure 
detection, this work wishes to stress the connection between “stabilizing fail- 
ure detectors” and self-stabilization jHj (which has only partly been done by 
other authors PIE)) and, hence, argue that self-stabilization is a concept of 
eminent practical importance. Furthermore, our results manifest a drawback of 
the approach that uses abstract failure detectors to solve problems in distributed 
computing, namely, that for every problem it is necessary to adapt the failure 
detector properties, a highly non-trivial task. 

Related Work. While predicate detection in fault-free environments has been 
intensely studied |3 , solving the task in faulty environments is not yet very well 
understood. To our understanding, Shah and Toueg m were the first to in- 
vestigate this by adapting the snapshot algorithm of Chandy and Lamport ^ 
with a simple timeout mechanism. Chandra and Toueg 0 later argued to de- 
fine the functionality of failure detection in an abstract way and proposed a 
rich set of failure detector classes. However, these classes were meant to help 
solve the consensus problem and not the problem of predicate detection. Garg 
and Mitchell P) investigate the predicate detection problem again and define 
an infinitely often accurate failure detector, i.e., a failure detector which is im- 
plementable in asynchronous systems El, but they restrict the scope of the 
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predicates to set- decreasing predicates. A predicate is set-decreasing whenever 
it holds for a set of processes, it also holds for a subset of these processes. To our 
knowledge, our work is the first to investigate the relationship between predicate 
detection and failure detection in the general case. 

While it is not clear whether Garg and Mitchell m or Shah and Toueg HSl 
consider predicates which contain references to the operational state of pro- 
cesses, Gartner and Kloppenburg H2! explicitly allow these types of predicates 
but restrict themselves to environments where only infinitely often accurate fail- 
ure detectors are available. Other authors have investigated the use of perfect 
failure detectors to detect special predicates, e.g., distributed deadlocks m or 
distributed termination ini- 

The observation that perfect failure detectors do not allow to solve all prob- 
lems which are solvable in synchronous systems has been previously made by 
Gharron-Bost , Guerraoui, and Schiper by exhibiting a problem that is solv- 
able in synchronous systems but is not solvable in asynchronous systems aug- 
mented with perfect failure detectors (the strongly dependent decision problem) . 
While Gharron-Bost et al. jO] argue that this result has practical consequences 
with respect to the efficiency of atomic commitment, our result shows that there 
exists a practically relevant problem, namely predicate detection, which suffers 
from the deficiencies of perfect failure detectors. 

Paper Organization. After introducing the system assumptions in Section |2|, we 
define three different semantics for the predicate detection problem in Section 

01 In Section^ we consider systems where crash failures occur, augmented with 
infinitely often accurate (Section 14.211 . with eventually perfect (Section and 
perfect failure detectors ISection 14.411 . Our focus is on a system with one ap- 
plication process and one observer. We then generalize to multiple application 
processes and observers in Section 0 Finally, Section El concludes the paper. For 
lack of space, we mostly only give proof sketches and relegate the full proofs to 
the extended version of the paper. 

2 System Assumptions 

2.1 System Model 

We consider a system with n application processes pi,...,p„ and m monitor 
processes 6i, ... ,5m (i.e., observers) whose task is to monitor the execution on 
the application processes. Processes communicate by message passing via FIFO 
channels in a fully connected network. Gommunication is reliable, i.e., no mes- 
sages are lost, duplicated, or altered. Our system is asynchronous, i.e., no bound- 
aries on communication delays nor on relative processor speeds exist. Application 
processes can fail by crashing. Once a process has crashed, it does not recover 
any more during the execution. A process which crashes is called faulty. A non- 
faulty process is called correct. For simplicity, we assume that monitor processes 
do not fail. If it is clear from the context we will refer to application processes 
simply as “processes” and to monitor processes as “monitors” . 

Every application process pi has a local state Si consisting of an assignment 
of values to all of its variables. A global state G = {si, . . . , s„} is a set containing 
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exactly one local state Si from every process pi. State changes are assumed to 
be atomic events local to some process such that an execution of the application 
processes can be modeled as a sequence of global stages Gi, G 2 , • ■ •, where G^+i 
results from Gi by executing a local event on some process. Global state Gi 
denotes the initial global state of the system. 

2.2 Failure Detectors 

Each monitor process has access to a local failure detector module that pro- 
vides (possibly incorrect) information about failures that occur on application 
processes P| . A process pt is suspected by monitor bj if the failure detector mod- 
ule associated with bj detects the failure of pi. Failure detectors are defined in 
terms of a completeness and an accuracy property. The completeness property 
requires that a failure detector eventually suspects processes that have crashed, 
while the accuracy property limits the number of mistakes a failure detector can 
make. We recall the definitions of accuracy and completeness which are relevant 
in this paper 0: 

— (strong completeness) Eventually every application process that crashes is 
permanently suspected by every monitor process. 

— (strong accuracy) No application process is suspected before it crashes. 

— (eventual strong accuracy) There is a time after which correct application 
processes are not suspected by any monitor. 

The failure detectors we consider in this paper all satisfy strong completeness. A 
perfect failure detector additionally satisfies strong accuracy. An eventually per- 
fect failure detector satisfies eventual strong accuracy instead of strong accuracy. 
Garg and Mitchell [11] have introduced an additional accuracy property: 

— (infinitely often accuracy) Gorrect application processes are not permanently 
suspected by any monitor. 

An infinitely often accurate failure detector satisfies strong completeness and 
infinitely often accuracy. 

Ghandra and Toueg |5| assume that failure detectors are passive modules. 
A monitor queries the module to learn about the operational state of other 
processes. In this setting, an infinitely often accurate failure detector offers no 
“real” accuracy at all. This is because a monitor may always query the failure 
detector when it is not accurate, i.e., when it erroneously suspects the process. 
Garg and Mitchell m therefore assume that the complete history of a failure 
detector is available so that an application does not miss the change of a failure 
detector output. This is equivalent to a model where the failure detector asyn- 
chronously (via interrupts) notifies the application about any change in its state. 
This model clearly makes more assumptions than the query model of Ghandra 
and Toueg (Sj. However, we now argue that these two models are equivalent 
for perfect and eventually perfect failure detectors in the context of predicate 
detection. 

Since we use infinitely often accurate failure detectors, we necessarily use the 
interrupt model of Garg and Mitchell El in this paper. Note that query based 
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failure detectors can be turned into interrupt style failure detectors by adding a 
concurrent task to the system which periodically queries the failure detector and 
interrupts the application whenever it detects state changes. For perfect failure 
detectors, this transformation can only increase the detection latency but main- 
tains the accuracy and completeness properties. Similarly, for eventually perfect 
failure detectors, only false detections may go unnoticed. Hence, the predicate 
detection results we obtain for perfect and eventually perfect failure detectors 
in the interrupt model also hold in the query model. 

Failure detectors are grouped into classes that represent the set of failure de- 
tectors satisfying the given properties. We denote the class of all perfect failure 
detectors by V, the class of all eventually perfect failure detectors by OV, and 
the class of all infinitely often accurate failure detectors by nOV. Intuitively, a 
failure detector fd^ is stronger than a failure detector fd 2 (denoted fd^ fd 2 ) 

if there exists a distributed algorithm that can be used to emulate fd 2 using 
fdi j^. If fdi ^ 7^2 and fd 2 h fdi we write fd^ = fd 2 - If fdi >: 7^2 but not 
7^2 ^ fdi we say that fdi is strictly stronger than 7^2 write fdi >- 7^2- The 

relation ^ can be defined for failure detector classes in an analogous way. From 
the literature mn\ we know that the following relations hold: V >- OV >~ OOV. 

3 Predicate Detection 

Given some global predicate Lp over the global state G of pi, . . . ,p„, we would 
like to have an algorithm which answers the question of whether or not holds 
in a given computation. Whenever an event occurs at some application process 
Pk, a control message about this event is sent from pk to all the monitors (the 
normal computation messages are called application messages). One generally 
assumes that the execution of the event and the sending of the control message 
execute as one atomic action. 

We define a property of the system as a set of executions. A system satisfies 
a property iff every execution which is possible by the system is an element of 
the property. Every such property can be written as the intersection of a safety 
property and a liveness property mini 

3.1 Three Different Detection Semantics 

We use the symbols □ (“always”) and O (“eventually”) here in the following 
way: Let S' be a safety property. We denote by OS the property in which S 
eventually holds, i.e., the set in which every execution in S can be prefixed by 
an arbitrary but finite sequence of states. We denote by DOS the property where 
S holds infinitely often, i.e., the set consisting of all traces which can be con- 
structed from (infinitely) interleaving finite sequences from S and OS. Note that 
SCOSC DOS. 

A detection algorithm for a global predicate p should notify us by triggering 
a detection event on the monitor processes once p holds in the computation. 
Formally, we seek an algorithm which satisfies the following two properties: 

S (safety) If a monitor triggers a detection event then p has held in the com- 
putation, and 
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L (liveness) once (p holds in the computation, a monitor will eventually trigger 
a detection event. 

We assume that the algorithm triggers a positive signal on detection but we also 
require the algorithm to revoke its detection by issuing a “previous detection 
was wrong” signal to the application. If this occurs, we say that the algorithm 
undetects the predicate. Note that we detect whether the predicate <p> held which 
is a stable predicate even if tp is unstable. A predicate is stable, if once it holds 
it holds forever. 

Definition 1 (Detection Semantics). Let S and L denote the safety and live- 
ness properties of predicate detection. We define the three detection semantics 
Semi, Sem 2 , and Sem^ as follows (where -\- denotes set intersection): 

1. Semi = L S (perfect) 

2. Sem 2 = L -\- <SS (stabilizing) 

3. Semz = L DOS' (infinitely often accurate) 

To illustrate the use of these detection semantics, assume that is a de- 
bugging condition, i.e., a “bad state” which should not occur. On detection of 
such a state, the software developer usually wants to stop the application sys- 
tem and analyze the execution which caused the bad state. In this context, we 
would ideally like a detection algorithm Alg for predicate p to satisfy Semi, 
i.e., Alg will make no mistakes and not miss any occurrence of p. We call this 
perfect predicate detection. However, this is sometimes impossible to achieve. In 
particular, if p contains conditions about the operational state of processes, the 
detection algorithm might mistakenly detect p, i.e., violate S. In these cases Alg 
should at least satisfy Sem 2 , i.e., Alg may (erroneously) detect the predicate and 
later undetect it again. We call this stabilizing predicate detection because it is 
guaranteed that the algorithm will eventually stop making wrong detections. 

Note that from the user’s point of view there is no immediate way to distin- 
guish between a correct and an incorrect detection (this may only be achieved 
through other means, e.g., through halting the system and inspecting it). Stabi- 
lizing predicate detection may, however, still be useful, e.g., in situations where 
false detections merely effect the efficiency of an application (not its correct- 
ness) or in situations where achieving Semi is provably impossible. Revisiting 
our debugging example the developer may want to detect a predicate p in his 
distributed application in order to identify an invalid state of the application. 
Detection semantics Sem 2 are sufficient in most cases, as the developer can man- 
ually verify whether the predicate detection algorithm has been accurate. If it 
has not been, the predicate detection is continued. 

But even Sem 2 is sometimes impossible to satisfy, i.e., Alg may make in- 
finitely many mistakes about p holding. In this case we would prefer that Alg 
behaves according to Sem^, i.e., Alg continuously switches between phases where 
possible detections are accurate and phases where mistakes regarding p are 
made. We call this infinitely often accurate predicate detection. This means that 
if p never holds then every detection event will be followed by an undetection 
event. Semantics Sem^ offer close to no guarantees and, hence, can be considered 
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as a “best effort” specification. But at least it is better than ignoring the safety 
part of the specification overall (i.e., we rule out trivial predicate detection which 
always issues a detection event even in cases where (p never holds). Note that 
Semi Q Sem 2 Q Sem^. 

3.2 Global Predicates in Faulty Systems 

In systems with crash faults, it is a natural desire to detect a class of predicates 
which makes no sense in fault-free systems because we now additionally want 
to detect predicates involving the operational state of processes. For example, 
we may want to detect the fact that “process pi crashed while holding a lock.” 
To express this information within a global predicate we assume that the oper- 
ational state of a process is explicitly modeled by a boolean flag crashed which 
is part of the global state. More specifically, crashedi is true iff pi has crashed. 
Using this flag, we formalize the above predicate as locki = true A crashedi. 

With respect to the truth value of a predicate on the global state, the crashed 
variables are treated just like other variables local to the processes. Let a de- 
note a local predicate which only references local variables of a process, e.g., 
a = Xi = 1, and let fd denote a predicate which only contains references to 
the operational state of a process, e.g., /3 = crashedi. To detect a we can use 
a standard mechanism for predicate detection in fault free systems. Conversely, 
to detect /3 we can use a (reliable) failure detection algorithm. However, global 
predicates can be constructed from boolean combinations of a and fd. If we have 
disjunctions of a. and (d, e.g., Xi = 1\/ crashedi^ it is sufficient to run existing 
detection algorithms for a and (d independently and issue a detection event as 
soon as one of the algorithms issues such an event. However, global predicates 
that are a conjunction of a and fd are more difficult to detect and are the focus 
of this paper. 

The following examples illustrate the types of predicates we address. Let eci 
denote a local variable of pi which stores the sequence number of events ( “event 
count” ) on Pi . 

— eCi > 5 A crashedi, i.e., “pi crashed after event 5” 

— eCi = 5 A crashedi, i.e., “pi crashed immediately after event 5” 

— eCi < 5 A crashedi, i.e., “pi crashed before reaching event 5” 

— eCi = 5 A -^crashedi, i.e., “pi executed event 5”. 

Although we have not restricted the class of predicates, it should be noted 
that only those predicates are detectable that do not explicitly depend on global 
time. For instance, the predicate <p=“pi executed event Ci more than 10 seconds 
ago” is impossible to detect in an asynchronous system model where no notion 
of global time exists. In analogy to Charron-Bost et al. 0, we call predicates 
that do not refer to real time time free. 

4 Predicate Detection in Fanlty Systems 

We now consider an asynchronous system where crash faults can happen and 
study what types of detection semantics (i.e.. Semi, Sem 2 , or Sem^) are achiev- 
able using different classes of failure detectors. For simplicity, we will restrict our 
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On monitor b: 
variables: 

G : (si, ... ,s„) init (li , . .. ,In) 

crashed[l..n] array of {true, false} init {false, ..., false) 
ip : G X crashed —> {true, false} init (by application) 
h : {true, false} init false 

1 algorithm: 

2 upon (a message {i, e) arrives) do 

3 (update G[i] according to e) 

4 if p>{G, crashed) A ^h then 

5 h ~ true 

6 (trigger detection event) 

7 elsif -itp{G, crashed) A h then 

8 h ~ false 

9 (trigger undetection event) 

10 end 

11 upon (pi is suspected or rehabilitated by failure detector) do 

12 (update crashed[i\ accordingly) 

13 if p{G, crashed) A ^h then 

14 h := true 

15 (trigger detection event) 

16 elsif -iip(G, crashed) A h then 

17 h := false 

18 (trigger undetection event) 

19 end 

Fig. 1. Generic algorithm for predicate detection in faulty environments. 

attention to the case where n = m = 1, i.e., a system with two processes only, 
namely an application process and a monitor process. We discuss the case where 
n,m> 1 in Sectional 

If faults can happen, the predicate detection algorithm must cater for the 
fact that a failure detector issues a suspicion or rehabilitation of a process. A 
process is rehabilitated if it has been erroneously suspected and the failure de- 
tector revokes its suspicion. Figure^shows a generic detection algorithm for this 
case. A boolean flag h (“history”) is used to record the type of the most recent 
event which was triggered at the interface. 

4.1 Plausible Failure Detector 

Generally, the failure detector module accessed by the monitor issues suspicion 
and rehabilitation events for a process pi. We define the following two properties 
concerning these events: 

— (alternation) Suspicion and rehabilitation events for pi alternate, i.e., the 
failure detector module never issues two suspicion events without issuing a 
rehabilitation event inbetween, and vice versa. 

— (plausibility) Reception of a control message from pi is only possible in phases 
where pi is not suspected. 
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We argue later that these properties are fundamental to the predicate detection 
problem. Based on these properties, we define a plausible failure detector as 
follows: 

Definition 2 (Plausible Failure Detector). A plausible failure detector is a 
failure detector which satisfies alternation and plausibility. 

As shown in the following lemma, for most failure detector classes a plau- 
sible failure detector still belongs to the same class of failure detectors as its 
non-plausible equivalent. 

Lemma 1. Let F be either failure detector class OOV or <>V and let F denote 
the set of failure detectors from F which are plausible. Then F = F. 

Proof. A failure detector is rendered plausible by wrapping the failure detector 
and the delivery module for control messages into a separate module. If a con- 
trol message arrives after a suspicion event has been generated, a rehabilitation 
event is passed to the predicate detection algorithm before delivering the control 
message. Clearly, the wrapper can be implemented in asynchronous systems. 

Transforming a failure detector fd in V into a plausible failure detector using 
the algorithm in Figured may result in a weaker failure detector. Indeed, as- 
sume a system with one process p and one monitor 6 , where p has sent a control 
message msg and then crashes. Before b receives the control message, it detects 
the crash of p. On reception of msg, the plausible version of fd rehabilitates p in 
order to receive msg. Later, fd eventually suspects p again. However, a failure 
detector in V is not allowed to make mistakes, i.e., can never rehabilitate pro- 
cesses. Hence, if /d is made plausible with algorithm in Figure El it is no longer in 
class V. This is the reason why Lemma Q] — using this particular wrapper — does 
not hold for failure detectors in V. 

4.2 Using an Infinitely Often Accurate Failure Detector {n<yp) 

Consider a purely asynchronous system in which crash faults can happen. Garg 
and Mitchell El have shown that a failure detector in DOP, e.g., an infinitely 
often accurate failure detector, can be implemented in these systems. Addition- 
ally we assume that such a failure detector is plausible. Then, predicate detection 
is only achievable with semantics Sem^. 

Theorem 1. In asynchronous systems with crash failures and any failure de- 
tector in OOV it is (a) possible to satisfy detection semantics Sem^ but it is (b) 
impossible to satisfy detection semantics Sem2 and Semi for general predicates 
without a failure detector strictly stronger than DOV. 

Proof. We prove (a) using the standard algorithm in Figure ^ The reliable 
channel assumption ensures satisfaction of the liveness requirement of Sem^ and 
the plausible infinitely often accuracy of failure detection ensures the safety re- 
quirement of Sem^. Part (b) is proven indirectly: We assume that an algorithm 
satisfying Sem2 exists and use it to construct a failure detector that allows to 
solve consensus, a contradiction to the impossibility result by Fischer, Lynch 
and Paterson P. 
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On monitor b: 
variables: 

suspects set of (processes) init 0 
wasSuspected : {true, false} init false 

1 algorithm: 

2 upon {fd rehabilitates pi or a control message from pi arrives) do 

3 wasSuspeeted := false 

i it Pi € suspeets then 

5 wasSuspected := true 

6 suspeets := suspects \ {pi} 

7 (trigger event “rehabilitation of pi”) 

8 endif 

9 if (control message was received in line 2) then 

10 (deliver control message ) 

11 if wasSuspected then 

12 suspects := suspects U {pi} 

13 (trigger event “snspicion of pi ” ) 

14 endif 

15 endif 

16 upon {fd suspects pi) do 

17 if Pi ^ suspects then 

18 suspects := suspects U {pi} 

19 (trigger event “snspicion of pf’) 

20 endif 

Fig. 2. Implementation of a wrapper that makes any failure detector fd in DOV 
or OV plausible in asynchronous systems. Note that line 2 and 16 refer to events 
generated by fd while in lines 7, 10, 13, and 19 events at the interface of the 
plausible failure detector are triggered which are then processed at lines 2 and 
11 in the algorithm of Figured 

4.3 Using an Eventually Perfect Failure Detector (OP) 

Defining stronger limitations on incorrect failure suspicions results in stronger 
predicate detection semantics. As V and OV are stronger than OOV El, The- 
orem m (a) holds also for these failure detectors, i.e^ Sems can be achieved. 
Note that the failure detector must be made plausibl^ before being used in the 
algorithm of Figure d 

Corollary 1. In asynchronous systems with crash failures and any failure de- 
tector in OV it is possible to satisfy detection semantics Sem^. 

However, although an eventually perfect failure detector is stronger, it still is 
not sufficient to detect predicates perfectly. Actually, even a perfect failure de- 
tector cannot achieve perfect predicate detection. The intuitive reason for this is 



^ Although making V plausible with the wrapper in Figure 0 weakens the failure 
detector to OV, Theorem d (a) still holds. 
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depicted in Figure 0 Consider the case where a predicate ip is true iff pi crashes 
after event ei, i.e., p = ec\ = 1 A crashedi. After suspecting pi (see Figure 0 
(b)) the monitor b\ must eventually raise an exception to the application that 
the predicate held. However, b\ can never be sure that the predicate detection 
is accurate, because a message from p\ may arrive later informing it about an 
event C 2 (see Figure 0(a)). Since the message can be delayed for an arbitrary 
amount of time, b\ cannot distinguish between both scenarios. 



Pi 

b. 



crash 




predicate evaluates 
to true 



Pl 

bi 



predicate evaluates 
to true 




(a) 



(b) 



Fig. 3. Predicate p = ec\ = lAcras/iedi is not detectable according to detection 
semantics Semi with any failure detector in V. 



Theorem 2. In asynchronous systems with crash failures and any failure de- 
tector not strictly stronger than V it is impossible to satisfy detection semantics 
Semi ■ 



Corollary 2. In asynchronous systems with crash failures and any failure de- 
tector not strictly stronger than OV it is impossible to satisfy detection semantics 
Semi . 

On the other hand, detecting Sem 2 with OV is achievable. However, the 
proof again relies on the fact that the given failure detector is plausible. Using a 
non-plausible failure detector may cause a miss of the occurrence of certain pred- 
icates and thus violate the liveness property. Assume, for instance, the predicate 
p = Xi = 1 A —<crashedi, with x initially 0. Furthermore, assume that the failure 
detector fd in OV is not plausible and that it erroneously suspects process pi. 
Although Pi sends the control messages about an event that sets x to 1 and back 
to 0 again, the monitor does not detect that p has held. 

Theorem 3. In asynchronous systems with crash failures and any failure de- 
tector in OV it is possible to satisfy detection semantics Sem 2 - 

4.4 Using a Perfect Failure Detector (V) 

Even a perfect failure detector is not sufficient to perfectly detect all possible 
predicates. Indeed, from the point of view of predicate detection for general pred- 
icates the strongest possible detection semantics are the same as for OV. This 
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has already been shown in Theorem 0 i-6-> it is impossible to detect with Semi 
using any failure detector in V. However, since Sem 2 is achievable using <>V, it 
is also achievable with V. 

Corollary 3. In asynchronous systems with crash failures and a perfect failure 
detector it is possible to satisfy detection semantics Sem 2 - 

Interestingly, we can detect predicates of the form a A P according to Semi 
using 7^ if a is stable. The stability of a ensures that the predicate still holds, al- 
though control messages may still arrive from events that occurred immediately 
before the crash of the process (see Figure E|. 



4.5 Introducing Failure Detector Class P 

A perfect failure detector is not sufficient to achieve optimal detection seman- 
tics in asynchronous systems. Intuitively, this is because V offers no informa- 
tion about the relative ordering of the crashes with respect to other application 
events. Consequently, we require a plausible failure detector that is still in V. 
However, we show that this plausible failure detector is actually stronger than 
any failure detector in V. 

Definition 3 (Ordered Perfect Failure Detector). An ordered perfect fail- 
ure detector is a perfect failure detector which satisfies the following additional 
order property: Together with every “suspicion of pi ” event, the failure detector 
issues the event number of the last event that happened on pi . 

We denote the class of all ordered perfect failure detectors by V. 

Theorem 4. V >- V. 

Proof. The fact that V is at least as strong as V is obvious. The proof that V 
is not at least as strong as V reuses the idea of Theorem O since V allows to 
distinguish the two situations which were indistinguishable if only V is available. 

An ordered perfect failure detector allows to order crashes and normal pro- 
cess events causally, i.e., if a suspicion is issued by the failure detector and the 
associated sequence number is x, then delivery of the suspicion event can be held 
back until all control messages which have sequence numbers below or equal to x 
have been delivered. Hence, plausibility for an ordered perfect failure detector is 
achieved, which in turn means that the detection algorithm from Figure fallows 
to detect predicates with detection semantics Semi. 

Theorem 5. In asynchronous systems with crash failures and an ordered perfect 
failure detector it is possible to detect general predicates with detection semantics 
Semi . 
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Overall, perfect detection of general predicates in asynchronous systems is 
achievable only if we postulate a failure detector that is strictly stronger than a 
perfect failure detector. This is somewhat disappointing since even perfect failure 
detectors are very hard to implement in practice. (This also shows that in these 
situations stabilizing service semantics (like Seni 2 ) are reasonable path to fol- 
low.) However, ordered perfect failure detectors can still be implemented using a 
timely computing base HE]. In such an approach, an asynchronous network is en- 
hanced by a synchronous real-time control network. The asynchronous network 
is assumed to be high-bandwidth and is used for regular “payload” traffic while 
the synchronous network is only used for small control messages and therefore 
can be low-bandwidth. Under these assumptions it is possible to build a failure 
detection service that satisfies the order requirement of V by synchronously pass- 
ing information about sent messages over the control network. Unfortunately, the 
execution of the event, and the sending on the synchronous and asynchronous 
network together have to be executed as an atomic action, which is a rather 
strong assumption. However, with this approach, a remote process accurately 
detects process crashes and is aware of the number of control messages sent 
prior to the crash. 

5 Generalization to n Processes and m Monitors 

In the previous sections we consider predicates local to one process in conjunc- 
tion with a predicate on the operational state of this process, i.e., a f\ j3. This 
section generalizes our results to scenarios with multiple processes (i.e., n > 1) 
and to multiple monitors (i.e., m > 1). The algorithms presented in Figures [D 
and El are thus executed on every monitor. In the context of n processes and m 
monitors, the predicates are of the form {a\ A /3i) op (02 A /32) op . . ., where op 
denotes either A or V. 

In a system with n processes and m monitors, a causal broadcast mecha- 
nism is used so that the control messages are received by the monitors in causal 
order m- 

5.1 Observer Independence 

Generalizing predicate detection to systems with multiple processes and multiple 
monitors gives rise to the issue of observer independence. Depending on the pred- 
icate ip and the setting in which it is evaluated, the validity of certain predicates 
depends on the observer m- Observer independence is achieved if all possible 
observations of the system result in the same truth value for ip |5] . Assume, for 
instance, that process pi executes an assignment x \= x + 1 (i.e., event e\ in Fig- 
ure and p 2 an assignment y := y + 1 (i.e., event ef) on variables x and y which 
are initially 1. While monitor bi detects the predicate p = x = lAy = 2,b2 does 
not; the predicate p is thus not observer independent, although the correspond- 
ing local predicates (i.e., x = 1 and y = 2) are. Charron-Bost et al. jEj have 
shown that observer independence is maintained for the disjunction of observer 
independent predicates, whereas it generally is not for the conjunction. 
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Fig. 4. Example of an observer dependent predicate, where e\ specifies the event 
a; := a; -I- 1 and ef the event y := y + 1. 

In general, two approaches are possible to address the problem of observer 
independence: (a) limiting the set of observed predicates or (b) defining a differ- 
ent notion of what it means for cp to hold. We will focus on the former approach 
here. The latter approach has been studied by Gartner and Kloppenburg m 

5.2 Limiting the Set of Observed Predicates 

The global predicates we are considering consist of the conjunction and disjunc- 
tion of local predicates Oj and predicates about the operational state of processes 
Pi. Unreliable failure detection introduces a new source of observer dependence; 
observer independence for a global predicate generally depends on the failure 
detector. Obviously, Pi is detectable in an observer independent way if a perfect 
failure detector is available. However, predicates of type A Pi need an ordered 
perfect failure detector to be detectable in an observer independent way. A fail- 
ure detector of class V is sufficient, if ai is stable. On the other hand, a failure 
detector in <>V only achieves “eventual” observer independence, whereas with 
□ OP, observer independence may never be achieved. 

Limiting the set of observed predicates to observer independent predicates 
considerably reduces the number of global predicates that can be detected. How- 
ever, following Charron-Bost et al. and the above findings, we can construct 
new observer independent global predicates from smaller building blocks. For ex- 
ample, disjunctions of stable local predicates in conjunction with predicates on 
the operational state of processes, i.e., {ai A Pi) V {aj APj), remain observer inde- 
pendent if a failure detector in V is available. On the other hand, conjunctions of 
observer independent local predicates and predicates about the operational state 
of processes, i.e., {ai A Pp A {aj A Pj), generally are not observer independent. 

6 Conclusions 

This paper investigates the predicate detection semantics that are achievable for 
general predicates using either failure detector classes GOP, OP, or P. A general 
predicate thereby is of the form a A /3, where a is a local predicate and P denotes 
a predicate on the operational state of a process, i.e., specifies whether a process 
has crashed or not. We define three different predicate detection semantics: per- 
fect (i.e.. Semi), stabilizing {Sem 2 ), and infinitely often accurate {Semp). Our 
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Table 1. Strongest achievable predicate detection semantics with respect to 
types of predicates and the failure detector class available. Again, a denotes 
a predicate which refers only to normal process variables and /3 is a predicate 
referring only to the operational state of the process. 



Failures 


Predicates 


Failure Detector Class 


Achievable Semantics 


Reference 


none 


a 


none 


Semi 


Q 


crash 


a 


none 


Semi 
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crash 


p 


V 


Semi 




crash 


p 


OV 


Sem2 


m 


crash 


p 


aov 


Sems 


p] 


crash 


ay p 


V 


Semi 


Sect.rOI 


crash 


a A P 


nov 


Sems 


ThmE 


crash 


a A P 


OV 


Sem2 


ThmB] 


crash 


a A P 


V 


Sem2 


Cor. 0 


crash 


a A P, a stable 


V 


Semi 


Sect. 1331 


crash 


a A P 


V 


Semi 


ThmEl 



results show that failure detector class UOV allows to detect general predicates 
according to Sems, whereas OV enables Sem 2 - Somewhat surprisingly, a perfect 
failure detector is not sufficient to detect general predicates according to Semi, 
indicating the importance of stabilizing detection semantics. This leads to the 
definition of a stronger failure detector, called ordered perfect and denoted V. 
With V, perfect predicate detection (i.e.. Semi) is achievable. An overview of 
our results is shown in Tabled 

In the future, we plan to further investigate issues of observer independence 
in systems with n processes and m monitors and consider predicate detection 
under more severe fault assumptions, e.g., crash-recovery. 
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Abstract. We investigate a new property of computing systems called 
weak stabilization. Although this property is strictly weaker than the 
well-known property of stabilization, weak stabilization is superior to 
stabilization in several respects. In particular, adding delays to a system 
preserves the system property of weak stabilization, but does not neces- 
sarily preserve its stabilization property. Because most implementations 
are bound to add arbitrary delays to the systems being implemented, 
weakly stabilizing systems are much easier to implement than stabilizing 
systems. We also prove the following important result. A weakly sta- 
bilizing system that has a finite number of states is in fact stabilizing 
assuming that the system execution is strongly fair. Finally, we discuss 
an interesting method for composing several weakly stabilizing systems 
into a single weakly stabilizing system. 



1 Introduction 

There has been a growing interest in recent years to design and implement stabi- 
lizing computing systems. See for instance, Q, g], IHI, P, and [7|. Unfortunately, 
stabilizing systems are difficult to implement in such a way as to preserve their 
stabilization properties. (This fact is often ignored in light of the immense in- 
tellectual pleasure that one can derive from designing such systems.) The main 
reason for this difficulty is that stabilization properties are extremely delay sen- 
sitive. The simple transformation of adding a small delay unit to a stabilizing 
system can yield this system non-stabilizing . Because every system implemen- 
tation is bound to add one or more delay units to the system being implemented, 
the implemented system often ends up being non-stabilizing. 

This situation leaves the designers of stabilizing systems, who wish to imple- 
ment their designs, with three options. 

The first option is to identify all the delay units that may be added to the 
system during its implementation, then ensure that the system with the added 
delay units is still stabilizing. This option is not attractive because a system with 
many delay units has a relatively large state space, and the task of ensuring that 
such a system is stabilizing is usually hard. 

The second option is to implement the system in a way that does not neces- 
sarily preserve its stabilization properties, then hope for the best, namely that 

* This work is supported in part by DARPA contract F33615-01-C-1901. 
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the implemented system will turn out to be “almost stabilizing”. In this case, 
the system designer does not plan on validating this hope, and so the concept 
of “almost stabilization” is left vague and undefined. 

The third option is to introduce a weaker version of the stabilization prop- 
erty, then show that this weak stabilization is delay insensitive. In this case, 
most sensible implementations of stabilizing (or even weak stabilizing) systems 
are guaranteed to be weakly stabilizing. 

In this paper, we adopt this third option, and give a formal characterization of 
the weak stabilization property. In particular, we show that weak stabilization is 
delay insensitive and that it is a good approximation of the original stabilization 
property. 

2 Stabilization and Weak Stabilization 

A (computing) system is a nonempty set of variables, whose values are from pre- 
defined domains, and a nonempty set of actions that can be executed to update 
the values of the variables. Each action is of the form: 

(guard) (assignment) 

The (guard) is a Boolean expression over the system variables, and (assignment) 
is an assignment statement of the form: 

(variable) := (expression) 

The (expression) is an expression over the system variables and its value is from 
the domain of (variable). 

For simplicity, we require that each variable of a system S be in the left-hand 
side of the assignment of at most one action of system S. In this case, the v action 
refers to the action where variable v is in the left-hand side of its assignment. 

A state of a system S is a function that assigns a value to each variable of S. 
The value assigned to each variable is from the domain of that variable. 

An action of a system S is enabled at a state p of S iff the guard of the action 
is true at state p. For simplicity, we assume that at least one action of a system 
S is enabled at each state of S. 

A transition of a system S is a pair (p, q), where p and q are states of S 
and there is an action of S that is enabled at state p and executing this action 
starting at state p yields system S in state q. 

A computation of a system S is an infinite sequence p.O, p.l, ... of S states 
such that every pair (p.i, p.(i-l-l)) of successive states in the sequence is a tran- 
sition of S. 

A state predicate of a system S is a function that has a Boolean value, true 
or false, at each state of S. Let true denote the state predicate whose value is 
true at each state of S, and false denote the state predicate whose value is false 
at each state of S. 

A state p of a system S is a P state iff P is a state predicate of S whose value 
is true at state p. 

Let P and Q be state predicates of a system S. Predicate P equals predicate 
Q, denoted P = Q, in system S iff P and Q have equal values at every state of S. 
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Predicate P implies predicate Q, denoted P ^ Q, in system S iff for every state 
p of system S, if P is true at p then Q is true at p. 

A state predicate P of a system S is closed in S iff for each transition (p, q) 
of S, if p is a P state, then q is a P state. 

A system S is stabilizing to a state predicate P iff the following two condi- 
tions hold. First, P is closed in S. Second, for every state p, every computation 
of S that starts at p has a P state. 

A system S is weakly stabilizing to a state predicate P iff the following two 
conditions hold. First, P is closed in S. Second, for every state p, there is a 
computation of S that starts at p and has a P state. 

Theorem 1. If a system S is stabilizing to a state predicate P, then S is weakly 
stabilizing to P. The converse does not necessarily hold. 

Proof. Stabilization clearly implies weak stabilization. It remains to show that 
the converse does not necessarily hold. Consider a unidirectional token ring sim- 
ilar to that discussed in ^j. Assume that there are four or more actions in the 
ring and that the domain of values for each variable is 0..1. It is straightforward 
to show that this ring is weakly stabilizing but not stabilizing to an appropriate 
state predicate P. 

3 Proof Obligations 

In this section, we state proof obligations that can be used to prove that a sys- 
tem S is stabilizing, or weakly stabilizing, to a state predicate P. But first, we 
need to introduce the concepts of well-founded domain and ranking function. 

A well-founded domain is a pair (D, >) where D is a set of elements and > is 
a total order relation over the elements of D such that each sequence of elements 
of D, that is decreasing with respect to the relation >, is finite. 

A ranking function F of a system S is a function that assigns to each state p 
of S, a value F.p from a well-founded domain. 

The following three theorems state proof obligations for stabilization and 
weak stabilization. Correctness of these theorems is straightforward. 

Theorem 2. (Closure of P) 

If for every P state p of a system S and every transition (p, q) of S, 
q is a P state 
then P is closed in S. 

Theorem 3. (Convergence to P) 

If there is a ranking function F of a system S such that 
for every state p of S and every transition (p, q) of S 
F.p > F.q or q is a P state 
then for every state p of S, every computation of S, 
that starts at p, has a P state. 
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Theorem 4. (Weak Convergence to P) 

If there is a ranking function F of a system S such that 

for every state p of S, there is a transition (p, q) of S where 
F.p > F.q or q is a P state 

then for every state p of S, there is a computation of S 
that starts at p and has a P state. 

Note that a ranking function for proving stabilization (or convergence) of a 
system S needs to be decreased by every action of S, whereas a ranking function 
for proving weak stabilization (or weak convergence) of S needs to be decreased 
by at least one action of S. Thus, a ranking function for proving stabilization of 
a system is stricter than one for proving weak stabilization of the same system. 

4 Delay Insensitivity 

In this section, we discuss how to transform a system by adding a delay to it, 
and show that in general such a transformation preserves weak stabilization of 
the system but not its stabilization. 

Let V be a variable of a system S. System S can be transformed, by adding 
a delay to variable v, as follows. 

i. Add to system S a new variable dv, whose domain of values is the same as 
that of V. 

ii. Modify every action “g ^ s” of S by replacing every occurrence of v in the 
guard g and in the right-hand side of the assignment s by an occurrence of 
dv. 

iii. Add the action “dv yf v ^ dv := v” to system S. 

The resulting system after this transformation is denoted S<v>. The added 
variable dv in the transformed system can be thought of as a delayed version of 
variable v. The added dv-action in the transformed system can be thought of as 
the added delay to variable v. 

Next, we discuss the effect of adding a delay to some variable of a system 
on the closed predicates of that system. Let P be a closed predicate of a system 
S. Is predicate P also closed in the transformed system S<v>? The answer to 
this question is “no” in general. This is because the added variable dv does not 
occur in P. Thus, the value of dv can be arbitrary at a P state of system S<v>. 
Starting from this state and executing an action where some variable that occurs 
in P is updated using variable dv can yield system S in a state where predicate 
P no longer holds. This shows that P is not closed in S<v>. 

Though a predicate P that is closed in a system S is not necessarily closed 
in the transformed system S<v>, we show (in Theorem 5 below) that another 
predicate, related to P, is closed in S<v>. But first we need to introduce the 
concept of a state predicate being exclusive with respect to some variable in its 
system. 
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Let V be a variable of a system S, and P be a state predicate of S. Predicate 
P is v-exclusive in S iff for every P state p, if the v action in S, if any, is enabled 
at p, then no other action of S is enabled at p. 

Theorem 5. 

If P is a closed state predicate of a system S, and 
P is v-exclusive in S, 

then the state predicate (PA (dv = v \J G.v)) is closed in S<v>, 

where G.v is the negation of the disjunction of the guards of all 
actions, other than the dv action and the v action, in system S<v>. 

Proof. Let p be a (P A (dv = v V G.v)) state of system S<v>, and assume that 
an execution of an action c of S<v> starting at state p yields the system in a 
state q. We need to show that q is a (P A (dv = v V G.v)) state. There are six 
cases to consider. 

Gase 1: (p is a (P A dv = v) state and c is the dv action) 

This case is not valid because the dv action is not enabled at p. 

Gase 2: (p is a (P A dv = v) state and c is the v action) 

In this case, P is true at state q because P is true at state p and P is closed in 
system S. Also, G.v is true at state p because p is a P state and P is v-exclusive 
in S. Predicate G.v remains true at state q because v, the only variable updated 
by action c, does not occur in G.v. Thus, q is a (P A G.v) state. 

Gase 3: (p is a (P A dv = v) state and c is any other action) 

In this case, P is true at state q because P is true at state p and P is closed 
in system S. Also, dv = v at state q because dv = v at state p and neither dv 
nor V are updated by action c. Thus, q is a (P A dv = v) state. 

Gase 4: (p is a (P A G.v) state and c is the dv action) 

In this case, P is true at state q because P is true at state p and dv, the 
only variable updated by action c, does not occur in P. Also, dv = v at state q 
because c is the dv-action. Thus, q is a (P A dv = v) state. 

Gase 5: (p is a (P A G.v) state and c is the v action) 

In this case, P is true at state q because P is true at state p and P is closed 
in system S. Also, G.v is true at state p, and G.v remains true at state q because 
V, the only variable updated by action c, does not occur in G.v. Thus, q is a (P 
A G.v) state. 

Gase 6: (p is a (P A G.v) state and c is any other action) 

This case is not valid because no action, other than the dv action and the v 
action, can be enabled at p where G.v is true. 

Having established that the state predicate (P A (dv = v V G.v)) is closed 
in the transformed system S<v>, it seems reasonable to ask the following two 
questions. 

Given that system S is stabilizing to P, is the transformed system S<v> 
stabilizing to (P A (dv = v V G.v))? Given that S is weakly stabilizing to P, 
is S<v> weakly stabilizing to (P A (dv = v V G.v))? The next two theorems 
answer these two questions with “not necessarily” and “yes” respectively. 
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Theorem 6. 

If P is a V- exclusive state predicate in a system S, and 
S is stabilizing to P, 

then S<v> is not necessarily stabilizing to (P A (dv = v \J G.v)). 



Proof. We exhibit a system S that is stabilizing to a v-exclusive state predicate 
P, and show that the transformed system S<v> is not stabilizing to (P A (dv 
= V V G.v)). Consider a system S that has three binary variables, named “u”, 
“v”, and “out”, and the following three actions: 

1: U7fv^u:=v 

2: V7^u^v:=u 

3 : u = V ^ out := u 



It is straightforward to show that the state predicate P, defined as (u = v), 
is v-exclusive in S and that S is stabilizing to P. 

Now consider the transformed system S<v>. This system has four binary 
variables, namely “u”, “v”, “out”, and “dv”, and the following four actions: 

1 : u yf dv ^ u := dv 

2 : dv yf u ^ V := u 

3 : u = dv ^ out := u 

4 : dv y^ V ^ dv := V 

The predicate G.v, which is the negation of the disjunction of the guards 
of all actions other than the dv action and the v action in S<v>, is defined as 
follows. 

G.v = ^ (u yf dv V u = dv) 

= false 

Thus, the state predicate (P A (dv = v V G.v)) is defined as (u = v A dv = 

v). 



To show that the transformed system S<v> is not stabilizing to (u = v A dv 
= v), it is sufficient to exhibit an infinite computation of S<v> that does not 
have a state where (u = v A dv = v) holds. Consider a state p of S<v> where 
(u = V A u yf dv) holds. If system S<v> starts at state p and the four actions 
I, then 3, then 4, then 2 are executed repeatedly in this order, then S<v> will 
never reach a state where (u = v A dv = v) holds. Thus, S<v> is not stabilizing 
to (u = V A dv = v). 



Theorem 7. 

If P is a v-exclusive state predicate in a system S, and 
S is weakly stabilizing to P 

then S<v> is weakly stabilizing to (PA (dv = v \J G.v)). 

Proof. By Theorem 5, the state predicate (P A (dv = v V G.v)) is closed in 
the transformed system S<v>. It remains to be shown that for every state p of 
S<v>, there is a computation of S<v> that starts at p and has a (P A (dv = v 
V G.v)) state. Let p be any state of system S<v>, and q be the corresponding 
state of system S. Thus, the value of each variable of system S at state q equals 
the value of the corresponding variable of system S<v> at state p. Because S is 
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weakly stabilizing to P, there is a computation x of S that starts at q and has a 
P state. 

Now consider computation y of S<v> that is generated to track the genera- 
tion of computation x as follows. First, computation y starts at state p. Second, 
the sequence of actions of system S<v> that is executed to generate the states in 
y is the same as the sequence of actions of system S that is executed to generate 
the states in x, with the following exception. If the values of variables dv and v 
are different in the last state generated in y, then the next executed action in y 
is the dv action. Then, the generation of computation y continues to track the 
generation of x. Because computation x eventually reaches a P state, computa- 
tion y eventually reaches a (P A dv = v) state. This completes the proof that 
the transformed system S<v> is weakly stabilizing to (P A (dv = v V G.v)). 

5 Achieving Stabilization 

In the last section, we established that adding delays to a system S preserves the 
weak stabilization of S, but does not necessarily preserve the stabilization of S. 
In this section, we establish that weak stabilization is a “good approximation” 
of stabilization. In particular, we show that under reasonable conditions, namely 
that the system has a finite number of states and its execution is strongly fair, 
weak stabilization of the system implies stabilization of the same system. 

A computation of a system S is strongly fair iff for every transition (p, q) 
of system S, if state p occurs infinitely many times in the computation, then 
transition (p, q) occurs infinitely many times in the computation. 

A system S is stabilizing to a state predicate P under strong fairness iff P is 
closed in S, and for every state p of S, every strongly fair computation of S, that 
starts at p, has a P state. 

Theorem 8. 

If S is a system that has a finite number of states, and 
S is weakly stabilizing to a state predicate P, 
then S is stabilizing to P under strong fairness. 

Proof. Because S is weakly stabilizing to P, P is closed in S. It remains to be 
shown that for every state p of S, every strongly fair computation of S that starts 
at p has a P state. Let x be a strongly fair computation, of the form (p.O, p.l, 
. . . ), that starts at p (i. e. p = p.O). We need to show that computation x has a 
P state. 

Because S has a finite number of states, at least one state q occurs infinitely 
many times in computation x. Because S is weakly stabilizing to P, there is a 
computation (q, q.l, q.2, . . . ) that starts at q and has a P state. Thus, some q.i in 
the computation (q, q.l, q.2, . . . ) is a P state. Because q occurs infinitely many 
times in the strongly fair computation x, both transition (q, q.l) and state q.l 
occur infinitely many times in x. Similarly, because q.l occurs infinitely many 
times in x, both transition (q.l, q.2) and state q.2 occur infinitely many times 
in X. This argument can be extended to show that a P state, namely state q.i. 
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occurs (infinitely many times) in computation x. This completes the proof that 
S is stabilizing to P under strong fairness. 



6 Theorem of Weak Stabilization 

The next three theorems state interesting properties of weak stabilization. Be- 
cause similar properties hold for stabilization, these theorems serve as further 
evidence that weak stabilization is a “good approximation” of stabilization. 

Theorem 9. (Base Theorem) 

Each system is weakly stabilizing to the state predicate true. 



Theorem 10. (Theorem of Junction) 

If S is weakly stabilizing to a state predicate P, and 
S is weakly stabilizing to a state predicate Q, 
then S is weakly stabilizing to (P \J Q), and 

S is weakly stabilizing to (PA Q), provided that P A Q ^ false. 



Theorem 11. (Theorem of Weakening) 

If S is weakly stabilizing to a state predicate P, 

Q is a closed state predicate in S, and 
P => Q in S, 

then S is weakly stabilizing to Q. 

Proofs of these three theorems are rather straightforward. We present here 
the most interesting proof, namely the proof of the second part of Theorem 10. 

Because both P and Q are closed in system S, then (P A Q) is closed in S. 
It remains to show that for each state p of S, there is a computation that starts 
at p and has a (P A Q) state. Because S is weakly stabilizing to P, there is a 
computation (p.O, p.l, . . . ) that starts at p (i.e. p = p.O) and has a P state p’ 
(i.e. p’ = p.i for some i). If p’ is a Q state, then the required computation is 
(p.O, p.l, . . . ) itself. Otherwise, p’ is not a Q state. But because S is also weakly 
stabilizing to Q, there is a computation (q.O, q.l, . . . ) that starts at p’ (i.e. p’ 
= q.O) and has a Q state q (i.e. q = q.j for some j). Because p’ is a P state and 
P is closed in S, then every state in the computation (q.O, q.l, . . . ) is a P state. 
Thus, state q in the computation (q.O, q.l, . . . ) is a (P A Q) state. In this case, 
the required computation is (p.O, . . . , p.(i-l), p’, q.l, . . . , q.j, . . . ). Thus, S is 
weakly stabilizing to (P A Q). 

7 Composition of Weak Stabilization 

In this section, we describe a method for composing several weakly stabiliz- 
ing systems into a single weakly stabilizing system. Note that although there 
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are methods for composing several stabilizing systems into a single stabilizing 
system, the methods for composing weakly stabilizing systems seem richer than 
those for composing stabilizing systems. In particular, the method for composing 
weakly stabilizing systems described in this section cannot be used to compose 
stabilizing systems. 

A state predicate P of a system S is called a fixed point in S iff every action 
“g ^ V := E” of S is such that g = false or v = E at every P state. 

Let V and w be two variables of a system S. Variable v is an input of S iff v 
does not occur in the left-hand side of any assignment (in an action) in S. Vari- 
able w is an output of S iff w does not occur in the guard or in the right-hand 
side of any assignment (in an action) in S. We adopt the notation S[v, w] to 
denote a system S that has an input variable v and an output variable w. 

Two systems S[v, w] and T[w, v] are compatible iff they have no common 
variables other than v and w. 

Two compatible systems S[v, w] and T[w, v] can be composed into a single 
system, denoted S[v, w] || T[w, v], as follows. First, the set of variables of the 
composed system is the union of the two sets of variables of systems S[v, w] and 
T[w, v]. Second, the set of actions of the composed system is the union of the 
two sets of actions of systems S[v, w] and T[w, v]. 

Theorem 12. 

If S')?;, w] is weakly stabilizing to a fixed point P, 

T[w, w] is weakly stabilizing to a fixed point Q, 

S')?;, w] and T[w, u] are compatible, 

there is one-to-one correspondence between the values of v and 
the values of w such that 

P holds only at the corresponding value pairs, and 
Q holds only at the corresponding value pairs, 
then S')?;, w] || T[w, u] is weakly stabilizing to (PA Q). 

Proof. Because P is a fixed point in system S, each action “g ^ v := E” of S is 
such that g = false or v = E at every P state. Also, because Q is a fixed point 
in system T, each action “g ^ v := E” of T is such that g = false or v = E at 
every Q state. Thus, each action “g ^ v := E” of system S || T is such that g = 
false or V = E at each (P A Q) state, and the state predicate (P A Q) is a fixed 
point in the system S || T. It remains to show that for every state b of system S 
II T, there is a computation of system S || T that starts at state b and has a (P 
A Q) state. 

Let b be any state of system S || T, and let b|S be the “projection” of state b 
on system S. Because system S is weakly stabilizing to P, there is a computation 
X of S that starts at state b|S and has a P state. Thus, there is a computation 
(b.O, b.l, . . . ) of system S || T that starts at state b and has the same sequence 
of actions executed in computation x. Computation (b.O, b.l, . . . ) has a state 
c, i.e. b.i = c for some i, such that c|S is a P state that occurs in computation 
X. Because c|S is a P state, the values of the two variables v and w at state c 
constitute a corresponding value pair. 
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Because system T is weakly stabilizing to Q, there is a computation y of T 
that starts at state c|T and has a Q state. Thus, there is a computation (c.O, c.l, 
. . . ) of system S || T that starts at state c and has the same sequence of actions 
executed in computation y. Computation (c.O, c.l, . . . ) has a state c.j such that 
state c.j|T is a Q state that occurs in computation y. Because c.j|T is a Q state, 
the values of the two variables v and w at state c.j constitute a corresponding 
value pair. Thus, the values of variables v and w at state c.j is the same as the 
values of their values at state c. Therefore, c.j is a (P A Q) state. In other words, 
the computation (b.O, b.l, . . . , b.i, c.l, . . . , c.j, . . . ) starts at state b and has a 
(P A Q) state. This completes our proof that S || T is weakly stabilizing to (P 
A Q). 

8 Concluding Remarks 

In this paper, we have introduced the property of weak stabilization and showed 
that this property is superior to the (strictly stronger) property of stabilization in 
many respects. First, we showed that the proof obligations for weak stabilization 
are less severe than those for stabilization. This suggests that verifying weak sta- 
bilization is easier than verifying stabilization. Second, we showed that adding 
delays to a system preserves the weak stabilization properties of that system 
but does not necessarily preserve the stabilization properties of the system. This 
suggests that weak stabilizing systems are easier to implement than stabilizing 
systems. Third, we showed that weakly stabilizing systems that have a finite 
numbers of states are in fact stabilizing under strong fairness, and showed that 
weak stabilization satisfies several interesting theorems that are also satisfied by 
stabilization. This suggests that weak stabilization is a “good approximation” 
of stabilization. Fourth, we described a method for combining weakly stabiliz- 
ing systems and argued that this method cannot be used to compose stabilizing 
systems. This suggests that weak stabilizing systems are easier to compose than 
stabilizing systems. 
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Abstract. We present a formal specification of the PING protocol, and 
use three concepts of convergence theory, namely closure, convergence, 
and protection, to show that this protocol is secure against weak adver- 
saries (and insecure against strong ones). We then argue that despite the 
security of PING against weak adversaries, the natural vulnerability of 
this protocol (or of any other protocol for that matter) can be exploited 
by a weak adversary to launch a denial of service attack against any 
computer that hosts the protocol. Finally, we discuss three mechanisms, 
namely ingress filtering, hop integrity, and soft firewalls that can be used 
to prevent denial of service attacks in the Internet. 

1 Introduction 

Recent intrusion attacks on the Internet, the so called denial of service attacks, 
have been through the well-known PING protocol |3]. These repeated attacks 
raise the following two important questions: Is the PING protocol secure? Gan 
the PING protocol be made secure enough to prevent denial of service attacks? 
In this paper, we use several concepts of convergence theory to answer these two 
important questions. In particular, we use the three concepts of closure, con- 
vergence, and protection to show that the PING protocol is in fact secure. We 
also argue that the PING protocol cannot be secure enough to prevent denial 
of service attacks. An adversary can always exploit the natural vulnerability of 
any (possibly secure) protocol to launch a denial of service attack against any 
computer that hosts this protocol. We briefly discuss several techniques that 
can be used to safeguard the computers in any network against denial of service 
attacks on that network. 

The PING protocol in this paper is specified using a version of the Abstract 
Protocol Notation presented in p| . Using this notation, each process in a proto- 
col is defined by a set of constants, a set of variables, a set of parameters, and a 
set of actions. For example, in a protocol consisting of two processes p and q and 
two channels (one from p to q and one from q to p), process p can be defined as 
follows. 
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process 

const 


p 

<name 


of 


constant> ; 


<type 


of 


constant> 




<name 


of 


constant> ; 


<type 


of 


constant> 


var 


<name 


of 


variable> ; 


<type 


of 


variable> 




<name 


of 


variable> ; 


<type 


of 


variable> 


par 


<name 


of 


parameter> ; 


<type 


of 


parameter> 




<name 


of 


parameter> ; 


<type 


of 


parameter> 


begin 


<action> 











[] <action> 

[] <action> 

end 

The constants of process p have fixed values. The variables of process p can 
be read and updated by the actions of process p. Comments can be added any- 
where in a process definition; each comment is placed between the two brackets 
{ and }. 

Each <action> of process p is of the form: 

<guard> -> <statement> 

The guard of an action of p is one of the following three forms: a boolean 
expression over the constants and variables of p, a receive guard of the form 
rev <message> from q, or a timeout guard that contains a boolean expression 
over the constants and variables of every process and the contents of the two 
channels in the protocol. 

Each parameter declared in a process is used to write a finite set of actions 
as one action, with one action for each possible value of the parameter. For ex- 
ample, if process p has the following variable x and parameter i, 
var X ; 0 . . n-1 

par i ; 0 . . n-1 

then the following action in process p 
x=i -> x:=x+i 
is a shorthand notation for the following n actions. 

x=0 -> x:=x+0 

[] 

[] X = n-1 -> X := X + n-1 

Executing an action consists of executing the statement of this action. Exe- 
cuting the actions of different processes in a protocol proceeds according to the 
following three rules. First, an action is executed only when its guard is true. 
Second, the actions in a protocol are executed one at a time. Third, an action 
whose guard is continuously true is eventually executed. 

The <statement> of an action of process p is a sequence of <skip>, <send>, 
<assignment>, <selection>, or <iteration> statements of the following forms: 
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<skip> 


: skip 






<send> 


: send <message> to q 






<assignment> 


: <variable in p> 


: = 


<expression> 


<selection> 


: if <boolecin expression> 


-> 


<statement> 




[] <boolean expression> 
f i 


-> 


<statement> 


<iteration> 


: do <boolecin expression> 


-> 


<statement> 




od 






Executing an 


action of process p can cause a message 


to be sent to process 



q. There are two channels between the two processes: one is from p to q, and 
the other is from q to p. Each sent message from p to q remains in the channel 
from p to q until it is eventually received by process q or is lost. Messages that 
reside simultaneously in a channel form a set and so they are received or lost, 
one at a time, in any order and not necessarily in the same order in which they 
were sent. 



2 The PING Protocol 

The PING protocol (which stands for the Packet Internet Groper protocol) al- 
lows a computer in the Internet to test whether a specified computer in the 
Internet is up |0|. The test is carried out as follows. First, the testing computer 
p sends several echo request messages to the computer q[i] being tested. Second, 
the testing computer p waits to receive one or more echo reply messages from 
computer q[i]. Third, if the testing computer p receives one or more echo reply 
messages from computer q[i], p concludes that computer q[i] is up. On the other 
hand, if the testing computer p receives no echo reply message from computer 
q[i], p concludes that q[i] may not be up. The conclusion in this case is not cer- 
tain because it is possible (though unlikely) that all the echo request messages 
sent from p to q[i] or the corresponding echo reply messages from q[i] to p are 
lost during transmission. 

The testing computer p stores the test results in a local variable array named 
up that is declared as follows. 

var up : array [0 . . n-1] of boolean 

Note that n is the number of computers being tested. If at the end of a ses- 
sion of the PING protocol, up[i] = true in computer p, then computer q[i] was 
up sometime during that session. On the other hand, if at the end of a session, 
up[i] = false in computer p, then no firm conclusions can be reached (but it is 
likely that q[i] was down during that session). 

For computer p to ensure that a received echo reply message from a computer 
q[i] corresponds to the echo request message that p has sent earlier to q[i], p adds 
a random identifier id[i] to all the echo request messages that p sends to q[i] in 
a session of the PING protocol. When a computer q[i] receives an erqst(id[i]) 
message from computer p, q[i] replies by sending an erply(id[i]) message to p. 
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When computer p receives an erply(id[i]) message, p checks whether id[i] is the 
random identifier of the current protocol session with q[i]. If so, p accepts the 
message and assigns the corresponding up[i] element the value true. Otherwise, 
p discards the message. 

The process of the testing computer p in the PING protocol is defined as 
follows. 

process p 

const n, idmax, cmax 

var up : array [0 . . n-1] of boolean 

wait : array [0 . . n-1] of boolean 

id : array [0 . . n-1] of 0 . . idmax 

X : 0 . . idmax 

c : 0 . . cmax 

par i : 0 . . n-1 

begin 

~wait [i] -> up[i] ;= false; 

id[i] ;= random; 
c := 0; 

do (c < cmax) -> 

send erqst(id[i]) to q[i] ; 
c : = c + 1 ; 

od; 

wait [i] := true 

[] rev erply(x) from q[i] -> 

if wait [i] " X = id[i] -> 

up[i] := true 

[] ~ wait [i] V X <> id[i] -> 

{discard erply} skip 
fi 



[] 

end 



timeout ( wait [i] 

erqst(id[i] )#ch.p.q[i] + erply(id[i] )#ch.q[i] .p = 0 ) -> 
wait [i] := false 



Process p has three actions. In the first action, p recognizes that it is no 
longer waiting for any erply messages from its last session of the protocol with 
q[i], and starts its next session with q[i]. Process p starts the next session with 
q[i] by selecting a new random identifier id[i] for the new session and sending 
cmax many erqst(id[i]) messages to process q[i]. In the second action, process 
p receives an erply(id[i]) message from any process q[i] and decides whether to 
accept the message and assign up[i] the value true, or discard the message. In the 
third action, process p recognizes that a long time has passed since p has sent 
the erqst(id[i]) messages in the current session, and so the number of erqst(id[i]) 
messages in the channel from p to q[i], denoted erqst(id[i])#ch.p.q[i], is zero 
and the number of erply(id[i]) messages in the channel from q[i] to p, denoted 
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erply(id[i])#ch.q[i].p, is also zero. In this case, p terminates the current session 
with q[i] by assigning variable wait[i] the value false. 

The process for any computer q[i] being tested is defined as follows. 

process q[i: 0 .. n-1] 
const n, idmax 
input up : boolean 

var X ; 0 . . idmax 

begin 

rev erqst(x) from p -> 

if up -> send erply(x) to p 

[] ~ up -> skip 

fi 

end 

Process q[i] has a boolean input named up that describes the current state of 
q[i]. Clearly, the value of input up can change over time to reflect the change in 
the state of q[i]. Nevertheless, to keep our analysis of the PING protocol simple, 
we assume that the value of input up remains constant. 

Process q[i] has only one action. In this action, q[i] receives an erqst(x) mes- 
sage from p and either sends an erply(x) message to p (if input up in q[i] is true), 
or discards the received erqst(x) message (if input up in q[i] is false). 




Fig. 1. State transition diagram of PING. 



The state transition diagram for the PING protocol is shown in Figure [D 
There are three nodes in this diagram. Each of the three nodes represents a set 
of states of the PING protocol. Each node v is labeled with a state predicate 
S.v.i, whose value is true at every state represented by node v. The three state 
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predicates in the state transition diagram, namely S.O.i, S.l.i, and S.2.i, are de- 
fined as follows 

S.O.i = ~ wait[i] A B.i = 0 A C.i = 0 

5.1.1 = wait[i] A B.i > 0 A C.i = 0 A X.i A Y.i 

5.2.1 = wait[i] A B.i = 0 A C.i = 0 A Y.i 

where 

B.i = erqst(id[i])#ch.p.q[i] -|- erply(id[i])#ch.q[i].p 

B. i is the number of messages (in the two channels between p 
and q[i]) whose identifiers are equal to the value of variable id[i] 
in p. 

C-i = Er#id(erqst(r)#ch.p.q[i] -berply(r)#ch.q[i].p) 

C. i is the number of messages (in the two channels between p 
and q[i]) whose identifiers are different from the value of variable 
id[i] in p. 

X. i = ( (erply(id[i])#ch.q[i].p > 0) (up = true in q[i]) ) 

X. i states that if there is one or more erply(id[i]) message in the 
channel from a process q[i] to process p, then input up in q[i] 
has the value true. 

Y. i = ( (up[i] = true in p) (up = true in q[i]) ) 

Y. i states that if an element up[i] in process p has the value 
true, then input up in process q[i] has the value true. 

The directed edges in the state transition diagram in Figure 1 represent ex- 
ecutions of actions in processes p and q[i]. The directed edge from node S.O.i to 
node S.l.i represents an execution of the first action in process p. The directed 
edge from node S.l.i to node S.2.i represents an execution of the second action in 
process p. The directed edge from node S.2.i to node S.O.i represents execution 
of the third action in process p. The self-loop at node S.l.i represents any of the 
following: an execution of the action in process q[i], an execution of the second 
action in process p, and a loss of one message from one of the two channels 
between p and q[i]. 



3 The PING Adversary 

For two reasons, the PING protocol is designed to overcome the activities of a 
weak adversary, rather than a strong adversary. First, this assumption keeps the 
PING protocol, which performs a basic task (of allowing any computer to test 
whether another computer in the Internet is up) both simple and efficient. Sec- 
ond, by disrupting the PING protocol, a strong adversary achieves very little: 
merely convincing one computer that another computer in the Internet is up 
when in fact that other computer is not up. It is not clear what does a strong 
adversary gain by such disruption, and so it is doubtful that a strong adversary 
will attempt to use its strength to disrupt the PING protocol. 

The weak adversary considered in designing the PING protocol is one that 
can insert a finite number of erqst(x) messages into the channel from process p 
to a process q[i], and can insert a finite number of erply(x) messages into the 
channel from a process q[i] to process p. Identifiers of the messages inserted by 
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this adversary at any instant are different from the identifiers of the current 
session and any future session of the PING protocol. Thus, if the adversary in- 
serts an erply(x) message in the channel from process q[i] to process p at some 
time instant, and if p receives this message at some future instant, then p can 
still detect that x is different from the current value of variable id and discard 
the message. For convenience, we refer to every message whose identifier is dif- 
ferent from the identifiers of the current session and every future session as an 
adversary message. 

Figure 2 shows the state transition diagram that describes the activities of 
both the PING protocol and its (weak) adversary. Note that this diagram has 
three additional nodes over the diagram in Figure 1. These three nodes are la- 
beled with the state predicates U.O.i, U.l.i, and U.2.i defined as follows. 

U.O.i = ^ wait[i] A B.i = 0 A G.i > 0 

U.l.i = wait[i] A B.i > 0 A G.i > 0 A X.i A Y.i 

U.2.i = wait[i] A B.i = 0 A G.i > 0 A Y.i 



Adv u 




Fig. 2. State transition diagram of PING and adversary. 



Note that B.i, G.i, X.i, and Y.i are defined above in Section 2. Note also 
that each predicate U.v.i is the same as the corresponding predicate S.v.i except 
that the conjunct G.i = 0 in S.v.i is replaced by the conjunct G.i > 0 in U.v.i. 
Thus, each U.v.i state is the same as a corresponding S.v.i state except that 
some adversary messages are inserted into some channels in the protocol. 

In the state transition diagram in Figure |2 each edge labeled “Adv” repre- 
sents an adversary action where one or more adversary messages are inserted 
into some channels in the protocol. Each edge or self-loop labeled “u” represents 
an execution of some protocol action where an adversary message is either re- 
ceived by a process q[i] (and another adversary message is sent from q[i] to p) 
or received by process p (and discarded). 
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Note that, despite the adversary involvement, the state predicate Y.i holds 
at every S.2.i state and every U.2.i state. Thus, Y.i holds at the end of every 
session between process p and process q[i] of the PING protocol. 

4 Security of PING 

In this section, we use three concepts of the theory of convergence jS|, namely 
closure, convergence, and protection, to show that the PING protocol (presented 
in Section 2) is secure against the weak adversary (presented in Section 3). In 
general, to show that a protocol P is secure against an adversary D, one needs 
to partition the reachable states of P into safe states and unsafe states, then 
identify the critical variables of P (those that need to be protected from the 
actions of D), and show that the following three conditions hold ( P] and 0). 

i. Closure: 

The set of safe states is closed under any execution of a P action, and the set 
of reachable states (i.e. the union of the safe state set and the unsafe state 
set) is closed under any execution of a P action or a D action. 

ii. Convergence: 

Starting from any unsafe state, any infinite execution of the P actions leads 
P to safe states. 

iii. Protection: 

If an execution of a P action starting at an unsafe state s changes the values 
of the critical variables of P from V to V’, then there is a safe state s’ such 
that the values of the critical variables in s equals to V, and execution of the 
same action starting at s changes the values of the critical variables of P from 
V to V’. (Note that this condition is a generalization of the corresponding 
condition in |Zj which states that each execution of a P action starting at an 
unsafe state cannot change the values of the critical variables of P.) 

Following this definition, the security of the PING protocol can be estab- 
lished by identifying the safe, unsafe, and reachable states of the protocol, then 
identifying its critical variables, and finally showing that the PING protocol 
satisfies the three conditions of closure, convergence, and protection. 

The safe states of the PING protocol are specified by the state predicate S.i, 
where 

S.i = S.O.i V S.l.i V S.2.i 

The unsafe states of PING are specified by the state predicate U.i, where 
U.i = U.O.i V U.I.i V U.2.i 

Thus, the reachable states of the protocol are specified by the state predicate 
S.i V U.i. 

The PING protocol has only one critical variable, namely array up in process 
p. It remains now to show that the protocol satisfies the above three conditions 
of closure, convergence, and protection. 
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Satisfying the Closure Condition: From the state transition diagram in 
Figure CJ the set of safe states is closed under any execution of an action of 
the PING protocol. From the state transition diagram in Figure |2l the set of 
reachable states is closed under any execution of an action of the PING protocol 
or an action of the weak adversary. 



Satisfying the Convergence Condition: Along any infinite execution of the 
actions of the PING protocol, the following three conditions hold. 

i. No adversary message is added to the channel from process p to a process 

q[i]- 

ii. Each adversary message in a channel from process p to a process q[i] is 
eventually discarded (if up = false in q[i]), or replaced by another adversary 
message in the channel from q[i] to p (if up = true in q[i]). 

iii. Each adversary message in a channel from a process q[i] to process p is 
eventually discarded. 

Thus, starting from an unsafe U.i state, any infinite execution of the actions 
of the PING protocol leads the protocol to a safe S.i state where no channel has 
adversary messages. 



Satisfying the Protection Condition: Assume that an execution of an ac- 
tion of the PING protocol starting at an unsafe state s changes the value of array 
up in process p. Then, the executed action is one where process p receives an 
erply(x) message from a process q[i], where x = id. Thus, the received erply(x) 
message is not an adversary message, and receiving this message causes the value 
of element up[i] in p to change from false to true. Let s’ be the state that results 
from removing all the adversary messages that exist in state s. From the state 
transition diagram in Figure El state s’ is a safe state. At state s’, message er- 
ply(x) is still in the channel from process q[i] to process p. Thus, executing the 
action where process p receives the erply(x) message, starting at state s’, causes 
the value of element up[i] in p to change from false to true. This completes our 
proof of the security of the PING protocol against the weak adversary. 

It is also straightforward to show that the PING protocol is not secure against 
a strong adversary that can insert messages whose identifiers are equal to the 
identifier of the current session. Gonsider an unsafe protocol state s where the 
adversary has inserted an erply(x) message, where x = id, at the channel from 
a process q[i], whose input up is false, to process p. Executing the action where 
process p receives the inserted erply(x) message, starting at the unsafe state s, 
changes the value of element up[i] in p from false to true (even though the value 
of input up in q[i] is false). Because no action execution, starting at any safe 
state, changes the value of element up[i] in p from false to true (given that the 
value of input up in q[i] is false), then the protection condition does not hold. 
This argument shows that PING is not secure against a strong adversary. 
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5 Vulnerability of PING 

Security of the PING protocol against the weak adversary is established in the 
last section by showing that the protocol satisfies the three conditions of closure, 
convergence, and protection. The closure condition states that the unsafe states 
(specified by U.i) are the furthest that the adversary can lead the protocol away 
from its safe states (specified by S.i). The convergence condition states that, 
when the adversary stops inserting adversary messages into the protocol chan- 
nels, the PING protocol eventually converges from its current unsafe state to 
the safe states. The protection condition states that while the protocol is in its 
unsafe states (due to the influence of the adversary), the critical array “up” in 
process p is updated as if the protocol is in a safe state. 

Despite the security of PING against its weak adversary, the weak adversary 
can exploit the natural vulnerability of the PING protocol to attack any com- 
puter that hosts PING as follows. The adversary inserts a very large number of 
adversary messages into the protocol channels. The protocol processes p and q[i: 
0..n-l] become very busy processing and eventually discarding these messages. 
Thus, the computers that host these processes become very busy and unable to 
perform any other service. Such an attack is usually referred to as a denial of 
service attack. 

It follows that denial of service attacks by weak adversaries can succeed by 
exploiting the natural vulnerability of any (even the most secure) protocols. The 
only way to prevent denial of service attacks is to prevent the adversary messages 
from reaching the protocol processes. In other words, the adversary messages 
need to be detected as such and discarded before they reach their destination 
processes. The question now is how to detect the adversary messages? 

The answer to this question, in the case of the PING protocol, is straight- 
forward. In the known denial of service attacks that exploit the PING protocol, 
the adversary inserts messages whose source addresses are wrong. Because the 
source of the inserted messages is the adversary itself, the source address in each 
of these messages should have been the address of the adversary. However, the 
source address in each inserted erqst(x) message is recorded to be the address of 
p so that the reply to the message is sent to p. Also, the source address in each 
inserted erply(x) message is recorded to be the address of some q[i] in order to 
hide the identity of the adversary. 

6 Preventing Denial of Service Attacks 

To prevent denial of service attacks, the routers in the Internet need to be modi- 
fied to perform the following task: detect and discard any message whose source 
address is wrong. Note that this task can prevent any adversary from exploiting 
the natural vulnerability of any protocol, not only PING, to launch a denial 
of service attack against any host of that protocol. This task can be achieved 
using two complementary mechanisms named “Ingress Filtering” Q and “Hop 
Integrity” 0. These two complementary mechanisms can be described as follows: 
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i. Ingress Filtering: 

A router that receives a message, supposedly from an adjacent host H, for- 
wards the message only if the source address recorded in the message is that 
of H. 

ii. Hop Integrity: 

A router that receives a message, supposedly from an adjacent router R, 
forwards the message only after it checks that the message was indeed sent 
by R. 

These two mechanisms can be used together to detect and discard any mes- 
sage, whose source address is wrong, that is inserted by a weak or strong adver- 
sary into the Internet. Thus, a large percentage of denial of service attacks can 
be prevented. 

Another mechanism for detecting and discarding adversary messages is called 
soft firewalls. A soft firewall for a process p is another process fp that satisfies 
the following three conditions: 

i. Output Observation: 

Each message that process p intends for another process q is first sent to the 
firewall process fp before it is forwarded to process q. 

ii. Input Observation: 

Each message from another process q intended for process p is first sent to 
the firewall process fp before it is forwarded to process p. 

iii. Input Filtering: 

The firewall process fp maintains a coarse image of the local state of pro- 
cess p, and uses this image to detect and discard any inappropriate message 
intended for p from any other process or from any adversary. 

(Note that the soft firewall processes described here are similar to stateless 
firewall processes described in |2], with one exception. A stateless firewall does 
not maintain any image of the local state of the process behind the firewall, 
whereas a soft firewall process maintains a soft state image of the local state of 
the process behind the firewall.) 

A possible soft firewall for process p in the PING protocol is a process fp that 
maintains one bit “w” as a coarse state for array “wait” in process p. When- 
ever fp receives an erqst(x) message from process p intended for process q[i], fp 
assigns its bit w the value 1. Process fp keeps the value of bit w “1” for one 
minute, since fp received the last erqst(x) message from p, then fp assigns bit w 
the value “0”. (The one minute is an ample time for the sent erqst(x) message 
to reach the intended q[i] and for the resulting erply(x) to return from q[i] to p.) 
Whenever the firewall process fp receives an erply(x) message intended for p, fp 
checks the current value of bit w and forwards the received erply(x) message to 
process p only if the value of bit w is “1”. Thus, all erply(x) messages, that are 
generated by the weak adversary, are discarded by fp before they reach process 
p (as long as p itself does not send erqst(x) messages to any q[i]). 



On the Security and Vulnerability of PING 135 



7 Concluding Remarks 

Our objective in this paper is three-fold. First, we want to demonstrate the util- 
ity of a new definition of system security 0 that is based on the three concepts of 
closure, convergence, and protection of convergence theory. Our demonstration 
show that the concept of protection as presented in P is too strong, and suggest 
a sensible weakening of this concept, discussed in Section 4. Second, we want 
to formally show that the PING protocol is secure (against a weak adversary), 
despite the fact that this protocol has been used repeatedly to launch denial 
of service attacks against several computers in the Internet. Our proof, based 
on the two state transition diagrams in Figures ^ and |21 is both simple and 
straightforward. Third, we want to make the point that every protocol (whether 
secure or unsecure) has a natural vulnerability that can be exploited by (possibly 
weak) adversaries to attack any computer that hosts this protocol. Such attacks 
can be foiled, not by making protocols more secure which is impossible, but by 
detecting the attacking messages early on and discarding them promptly. In this 
regard, the ideas of ingress filtering, hop integrity, and soft firewalls offer much 
hope. 
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Abstract. In this paper, we present the hrst self-stabilizing protocols 
for ^-exclusion problem in the message passing model. The ^-exclusion 
problem is a generalization of the mutual exclusion problem — we allow 
^ (^ > 1) processors, instead of 1, to use a shared resource. We pro- 
pose a new technique for the design of self-stabilizing ^exclusion: the 
controller. This tool allows to count tokens of the system without any 
counter variable for all processors except one called Root. We also intro- 
duce a new protocol composition called parametric composition. Then 
we present protocols on rings and on trees. The space requirement of 
both algorithms is independent of I for all processors except Root. The 
stabilization time of the first protocol is 3n time, where n is the ring size 
and the stabilization time of the second one is 6/i -I- 2 time, where h is 
the tree height. 



1 Introduction 

Fault-tolerance is one of the most important requirements of modern distributed 
systems. Various types of faults are likely to occur at various parts of the system. 
The distributed systems go through the transient faults because they are exposed 
to constant change of their environment. The concept of self-stabilization 0 is 
the most general technique to design a system to tolerate arbitrary transient 
faults. A self-stabilizing system, regardless of the initial states of the processors 
and initial messages in the links, is guaranteed to converge to the intended behav- 
ior in finite time. In 1974, Dijkstra introduced the property of self-stabilization in 
distributed systems and applied it to algorithms for mutual exclusion 0 . In the 
mutual exclusion problem, there is an activity, that of executing a critical section 
of code, that only one process can do at a time. The i- exclusion problem is a 
generalization of the mutual exclusion problem — i processors are now allowed 
to execute the critical section concurrently. Algorithms for £-exclusion are given 
in piiR rmiiT^ . The problem was first defined and solved by Fischer, Lynch, 
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Burns, and Borodin The first self-stabilizing algorithm for the ^-exclusion 
problem was presented in |^. This solution is a generalization of Dijkstra’s al- 
gorithm 13 . The algorithm in Q is the second self-stabilizing solution to the 
^-exclusion problem, but in the shared memory model. In both cases the space 
requirement depends on the size of the network and i. Algorithms in and 0 
require at least 0(2") and Q{ntj states per process, respectively, where n is 
the size of the network. The first attempt to solve this problem with a space 
complexity independent of n (and almost independent of i) are presented in 1151 
(on chains), and PJ (on trees). All those algorithms run in the state model 
( |E|^Q^]), or in the shared memory model ( |3|). 

Contributions. In this paper, we present the first self-stabilizing protocols for 
^-exclusion problem in the message passing model. We propose a new technique 
for the design of self-stabilizing ^exclusion: the controller. This tool allows to 
count tokens of the system without any counter variable for all processors except 
one called Root. We also introduce a new protocol composition called parametric 
composition. The parametric composition of protocols Pi and P 2 allows inter- 
actions both from P\ to P 2 and from P 2 to Pi. This is a generalization of the 
collateral composition in ng and of the conditional composition in |S|. Using 
the controller on uni-directional rings, a simple circulation of i tokens, and the 
parametric composition we design a self-stabilizing ^exclusion protocol on rings. 
Using the controller on trees, a simple distribution of i tokens, and the paramet- 
ric composition we then design a self-stabilizing ^exclusion protocol on trees. 
The space requirement of both algorithms is independent of i for all processors 
except Root. The stabilization time of the first protocol is 3n time, where n is 
the ring size and the stabilization time of the second one is 6ft. -I- 2 time, where 
ft is the tree height. Using the power of both the controller and the parametric 
composition, we also discuss some adaptations of these protocols. 

Outline of the Paper. In Section 0 we describe the distributed system, the 
model we use in this paper, and also, state the specification of the problem solved 
in this paper. In Section 0 we present the parametric composition used to write 
our algorithms. In Section 0 we present a self-stabilizing f-exclusion protocol on 
rings. Then we present an implementation of this solution on tree networks in 
Section 0 Finally, we make some concluding remarks in Section 0 

2 Preliminaries 

2.1 Model 

Distributed systems we consider in this paper are asynchronous networks. We 
number the processors from 0 to n-1 for ease of notation only. We assume there is 
a distinguished processor (processor 0) that we often refer to as Root. Processors 
communicate with their neighbors by sending them messages. We assume that 
the time of message transit is finite but not bounded and as long as a message is 
not processed, we consider that it is in transit. Moreover, each link is assumed 
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to be bounded, messages transmitted over the links are not lost and arrive error 
free and in the order sent (FIFO) during and after the stabilization phase. We 
consider semi-uniform protocols lOl- So, every processor with the same degree 
executes the same protocol, except one processor. Root. The protocol consists of 
a collection of actions. An action is of the form: < guard > — >< statement >. 
A guard is a boolean expression over the variables of the processor and/or an 
input message. A statement is a sequence of assignments and/or message send- 
ings. An action can be executed only if its guard evaluates to true. When several 
actions of a processor are simultaneously enabled then the first in the text of the 
protocol is executed only. The state of a processor is defined by the values of its 
variables. The state of a system is a vector of n-\-l components where the first 
n components represent the state of n processors, and the last one refers to the 
multiset of messages in transit in the links. In the sequel, we refer to the state of 
a processor and the system as a (local) state and configuration, respectively. Let 
a distributed protocol 7^ be a collection of binary transition relations denoted by 
— on C, the set of all possible configurations of the system. A computation of 
a protocol V is & maximal sequence of configurations e = 70, 71, ..., 7i, 7i+i, ..., 
such that for i > 0,% ^ li+i (a single computation step) if 7^+1 exists, or 7^ is 
a terminal configuration. 

2.2 Self- Stabilization 

Definition 1 (Self- Stabilization). A protocol V is self-stabilizing for a spec- 
ification SV (predicate over the computations) if and only if every execution 
starting from an arbitrary configuration will eventually reach ( convergence prop- 
erty) a configuration from which it satisfies SV forever (closure property). 

In practice, we associate to 7^ a predicate on the system configurations, de- 
noted C-p and called the legitimacy predicate. We define Cp as follows: starting 
from a configuration a satisfying Cp, V always behaves according to SV, and 
any configuration reachable from a satisfies Cp (closure property). Moreover if 
any execution of V starting from an arbitrary configuration eventually reaches 
a configuration satisfying Cp (convergence property), we say that V stabilizes 
for Cp (hence for SV). 

2.3 ^-Exclusion 

Specification of the £-Exclusion Protocol. 

Safety. In any computation e, at most £ processors can execute the critical 
section concurrently. 

Liveness. 

1. Fairness. In any computation e, each requesting processor can enter the 
critical section in a finite time. 

2. f-Liveness. In any computation e, if a; < f processors execute the critical sec- 
tion forever and some other processors are requesting the critical section, then 
eventually at least another processor will eventually enter the critical section. 
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Self- Stabilizing i-Exclusion Protocol. An ^-exclusion algorithm is self-stabilizing 
if every computation starting from an arbitrary initial configuration, eventually 
satisfies the above safety and liveness requirements. 

3 Parametric Composition 

The parametric composition of protocols Pi and P2 is a generalization of the 
collateral composition in US] because it allows not only P2 to read the variables 
written by Pi but also Pi to read the variables written by P2- This is also a 
generalization of the conditional composition in |S| because it allows not only P2 
to use the predicates of Pi but also Pi to use the predicates of P2- Informally, 
Pi can be seen as a tool used by P2, where Pi call some “public” functions of P2 
(we use the term function with a generic meaning: it can be the variables used in 
the collateral composition or the predicates as in the conditional composition...), 
and P2 can also use some functions of Pi through the medium of parameters. 

Definition 2 (Parametric Composition). Let Pi be a protocol with param- 
eters and a public part. Let P2 be a protocol such that P2 uses Pi as an external 
protocol. By the parameters P2 allows Pi to use some of its functions (func- 
tion may return no result). By the public part protocol Pi allows protocol P2 to 
call some of its functions. The parametric composition of Pi and P2, denoted 
Pi>pP2, is a protocol that has all the variables and all the actions of Pi and P2. 

The implementation scheme of Pi and P2 is given in Algorithm d Protocol 
P2 allows Pi to use functions Fi,...,Fa (called F'i,...,F'a in Pi) and Pi allows P2 
to use its public functions Pubi,...,Pub,g (called Pi.Pubi,..., Pi.Pub,g in P2) 



Algorithm 1 Pi >p P 2 




Protocol P 2 


Protocol Pi f 1 : T . F^9 : T Fo F^ ^ : T Fn 1 : 


External Protocol Pi{Fi, F 2 ,...,Fq,); 


Public 


Parameters 


Pubi : TPi 


Fi : TFi 


/* definition of Function Pubi */ 


/* definition of Function Fi */ 


Pub/3 : TPp 


F« : 


/* definition of Function Pub/3 */ 


/* definition of Function F^ */ 


begin 


begin 


[] < Guard > < statement > 


[] < Guard > < statement > 


/* Functions F' i can be used 


/* Functions Pi.Pubi can be used 


in Guard and/or statement */ 


in Guard and/or statement */ 


end 


end 



Let Cl and £2 be predicates over the variables of Pi and P2, respectively. We 
now define a fair composition w.r.t. both protocols and define what it means for 
a parametric composite algorithm to be self-stabilizing. 
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Definition 3 (Fair Execution). An execution e of the composition of P\ and 
P2 is fair w.r.t. Pi {i G { 1 , 2 }) if one of these conditions holds: 

1 . e is finite. 

2 . e contains infinitely many steps of Pi, or contains an infinite suffix in which 
no step of Pi is enabled. 



Definition 4 (Fair Composition). The composition of Pi and P2 is fair w.r.t. 
Pi (z G { 1 , 2 }) if any execution of the composition of Pi and P2 is fair w.r.t. Pi. 

The following composition theorem and its corollary are obvious: 

Theorem 1. If the following four conditions hold: 

1 . composition is fair w.r.t. Pi, 

2 . composition is fair w.r.t. P2 if Pi is stabilized for Ci, 

3 . protocol Pi stabilizes for Li even if P2 is not stabilized for C2, and 
4 -. protocol P2 stabilizes for £2 z/£i is satisfied, 

then Pi >p P2 stabilizes for £1 A £2- 



Corollary 1. Let Pi >p P2 be a self-stabilizing protocol. If Protocol Pi stabilizes 
in ti time for £1 even if P2 is not stabilized for £2 and Protocol P2 stabilizes in 
t2 time for £2 when Pi is stabilized for Ci, then Pi >p P2 stabilizes for £1 A £2 
in ti + t2 time. 



4 Self-Stabilizing ^-Exclusion Protocol on Rings 



In this section, we present a self-stabilizing ^-exclusion protocol on unidirectional 
rings. Each processor can distinguish its two neighbors: the left neighbor from 
which it can receive messages, and the right neighbor to which it can send mes- 
sages. For a processor i the left neighbor (predecessor) is the processor i -1 and 
the right neighbor (successor) is the processor z-l-1, where indices are modulo n. 
The protocol we present is based on the concept of tokens: each time a request- 
ing processor owns a token, it can enter the critical section. A simple solution is 
a circulation of £ tokens such that no processor can keep more than one token 
while it is in critical section. We use a controller mechanism to ensure that the 
number of tokens will eventually be equal to £. The protocol we propose is a 
parametric composition of two protocols: i- Token- Circulation (see Algorithm OJ 
and Ring- Controller (see AlgorithmEJ. In the rest of the paper we call E-tokens 
(with E for Exclusion) the tokens used by AToken-Circulation and C-token (with 
C for Control) the token used by Ring-Controller. 
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4.1 Ring- Controller 

Ring-Controller allows to count the E-tokens of the system without any counter 
variable for all processors except Root. Ring-Controller uses a single token (the 
C-token) circulation. The C-token continuously makes ring traversals. At the 
beginning of each traversal, Ring-Controller informs Root that it can both end 
the current E-token counting and start a new counting (see Function START in 
£- Token-Circulation) . Roughly speaking, the C-token cannot pass any E-token 
and no E-token can pass the C-token, so during the C-token traversal, all the 
E-tokens will visit Root exactly once. Root knows the number of E-tokens after 
a complete traversal of the C-token. Since the C-token cannot pass an E-token 
(for fear of counting error, but we will see an exception in Section 14.21) . it must 
be stopped by a processor in critical section. As in PH the C-token circulation 
is implemented by the sending of SeqVal message whose value is different from 
that of the previous SeqVal message. We use two local variables {MySeq and 
NextSeq) to store the values of SeqVal messages. If a processor i {i ^ Root) 
does not hold any E-token when it receives the C-token {SeqVal yf MySeq), 
then it executes Ocs: it copies SeqVal in NextVal and in MySeq, then it sends 
the SeqVal message to z -1-1. If f is in critical section {i holds an E-token), it stops 
the C-token: Function STOP returns true and i just copies SeqVal in NextVal 
(Action ad). If i is still in critical section when it receives a copy of the SeqVal 
message, it does nothing {SeqVal = NextSeq). Processor i will send again the 
C-token after i exits the critical section and receives a new copy of the SeqVal 
message: in this case Function STOP returns false and i can execute Action 
Oc3- 



Algorithm 2 Ring-Controller. 


For Root 


For another processor 


RING-CONTROLLER(START) 

/* START is used to inform Root that 
the C-token has just passed */ 


RING-CONTROLLER (STOP: Boolean) 
/* STOP is used to stop Controller from 
sending messages, so it must be fair w.r.t. 
Controller */ 

Public Function Csend() 

MySeq NextSeq 

send MySeq to i-|-l 

end Function 


Variables MySeq : O..Max 


Variables MySeq, NextSeq : O..Max 


begin 

(Qci) [] (receive SeqVal from n — 1) 

A{MySeq — SeqVal) » 

MySeq MySeq 1 

send MySeq to 1 
START 


begin 

(<3.cs) [] (receive SeqVal from 2—1) 

A{{MySeq / SeqVal) ^ ^ STOP) ^ 

NextSeq SeqVal 

Csend 


{cLc 2 ) [] timeout >■ 

send MySeq to 1 

end 


(O'ca) [] (receive SeqVal from i — 1) 

A{NextSeq ^ SeqVal)ASTQP ;■ 

NextSeq SeqVal 

end 



142 



Rachid Hadid and Vincent Villain 



4.2 Self- Stabilizing Token-Circulation 

To explain the behavior of ^-Token-Circulation (see Algorithm 0) we assume 
that Ring-Controller is stabilized (i.e., the C-token is unique). 

The protocol uses two functions from the application which needs the f- 
exclusion: 

1. Function STATE in {Request, In, Out} meaning that the application is re- 
questing for the critical section, is in the critical section, or out of the critical 
section while not requesting it, respectively, 

2. Function ECS which does not return any value. This function allows the 
application to enter the critical section. 



Algorithm 3 f- Token-Circulation. 

For another processor 



For Root 

£-TOKEN-CIRCULATION(STATE: 

{Request, In, Out }, ECS) 

External 

RING-CONTROLLER(START) 

Parameters 

Function START() 

for fc — 1 to i-Cpt do 
send Token to 1 
Cpt i — Cpt + T 
end Function 
Variables T : 0..1 
Cpt : 

begin 

(ail) [] (STATE G {Request, Out}) 
A(T= 1) 

if STATE = Out then 

T 0 

send Token to 1 
else ECS 



{^12) [] (receive Token from n — 1) 

A (T ^ 0) A{Cpt <£) * 

Cpt Cpt + 1 
if STATE — Request then 
T := 1 
ECS 

else send Token to 1 

(ais) [] (receive Token from n — 1) 

A (T - 1) A(Cpt <£) * 

Cpt Cpt + 1 
send Token to 1 

end 



^-TOKEN-CIRCULATION(STATE: 

{Request, In, Out }, ECS) 

External 

RING-CONTROLLER(INCS: Boolean) 
Parameters 

Function INCS(): Boolean 
Return(STATE — In) 
end Function 



Variables T ; 0..1 
begin 

(014) [] (STATE G {Request, Out}) 

A(T= 1) ^ 

if STATE = Out then 

T 0 

send Token to i+1 
RING-CONTROLLER.Csend 
else ECS 

(<^i 5 ) [] (receive Token from i-l) 

A (T - 0) ^ 

if STATE — Request then 
T 1 
ECS 

else send Token to i+1 



(<Ti6) [] (receive Token from i-l) 

A (T = 1) ^ 

send Token to 2+1 
RING-CONTROLLER.Csend 

end 



r is a binary variable and means that the processor owns an E-token (T = 1) 
or does not (T = 0). When a processor receives an E-token, either it immediately 
sends it to its neighbor if it does not need it (STATE ^ Request or T = 1) or it 
keeps it if it is requesting the critical section (STATE = Request). 
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When a processor leaves the critical section (STATE = Out and T = 1) it 
sends the E-token it used to its neighbor. The last case (STATE = Request and 
T = 1) only appears in an initial configuration because during the execution of 
the protocol an E-token is kept only if the processor is requesting the critical 
section (see Actions ai 2 and ai^). 

To ensure a right E-token account we must forbid an E-token to pass the C- 
token. Unfortunately, this obligation leads the system to not verify the Uliveness: 
if the C-token is stopped by a processor in critical section forever then the other 
E-tokens will be stopped by the C-token. To prevent this drawback, our protocol 
allows another E-token to pass both the E-token used by the processor and the 
stopped C-token. Then the C-token can leave the processor which stopped it 
before a second E-token passes it (Action aie). We explain this double-passing 
mechanism below: 

Consider the following situation where a processor i has an E-token (U) and 
is executing its critical section. Moreover, i has stopped the C-token (C). An 
other E-token (t 2 ) is coming from i — 1\ 




When i receives t 2 it executes oiq. So it sends t 2 followed by C: 



t\ 



Ct2 



It is obvious that the above configuration is equivalent to: 



t2 



Cti 



From the above observation we can deduce that the double-passing mecha- 
nism has no effect on Root counting. More precisely, assuming the above equiv- 
alence, we can claim the following property: 



Property 1. Assuming Ring- Controller is stabilized for the predicate “the C- 
token is unique” , Root is visited exactly once by each E-token during a complete 
turn of the C-token. 



Moreover this mechanism allows us to claim a second property: 

Property 2. The circulation of the C-token and the E-tokens cannot be stopped 
forever. 



4.3 Correctness Proof 

From PEH and Property 13 we can deduce the following lemma: 

Lemma 1. The composition Ring-Controller t>p £-Token-Circulation is fair 
w.r.t. Ring-Controller . 
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Lemma 2. The composition Ring-Controller [>p £-Token-Circulation is fair 
w.r.t. £-Token-Circulation if Ring-Controller is stabilized. 

Proof. Suppose in the way of contradiction that there exists an execution of 
the composition which eventually contains no steps of the i- Token- Circulation. 
By Property El this implies that no E-token is present in the ring. Since Root 
initiates infinitely often the C-token traversal, after one round trip delay, Root 
detects that there is no E-token in the system. Then it can add i E-tokens when 
it initiates the next C-token traversal (Function START), a contradiction. □ 

From pnn we can claim that Ring-Controller stabilizes for the predicate 
“the C-token is unique” if the two following conditions hold: 

1. Constant Max > npMax + 1, where n is the ring size and pMax is the 
maximum capacity of communication links. 

2. A C-token cannot be stopped forever. 

So, by Lemma P we can deduce the following result: 

Lemma 3. Ring-Controller stabilizes for the predicate “the C-token is unique” 
even if £-Token-Circulation is not stabilized. 



Lemma 4. Assuming Ring-Controller is stabilized for the predicate “the C- 
token is unique ” then £-Token-Circulation stabilizes for the predicate “there exist 
exactly I E-tokens in the ring”. 

Proof. Once Ring-Controller is stabilized, by Property P Root can count the 
E-tokens without any error. More precisely, if the system contains less than £ 
E-tokens, Root will generate the missing E-tokens when the C-token starts a 
new traversal (Function START). If the system contains more than I E-tokens, 
Root will erase all the extra E-tokens during the C-token traversal (see Predicate 
{Gpt < £) of Actions an and a/ 3 ). □ 

Then it is easy to verify that if both above predicates are satisfied then the 
safety (the number of E-tokens is equal to £ forever), fairness, and Aliveness 
(the circulation of the E-tokens which are not required by some processors can- 
not be stopped forever) requirement are satisfied forever. So from Theorem P 
and Lemmas PPP and P we can claim the following theorem: 

Theorem 2. Ring-Controller c>p ^-Token-Circulation stabilizes for the £-exclus- 
ion specification. 

Stabilization Time Complexity. The stabilization time can be easily evaluated 
from m and Corollary P Ring- Controller stabilizes in two round trip delay or 
in 2n time. Once Ring- Controller stabilized, the next round trip ensures the 
right counting at Root. So, the stabilization time is at most 3n time. 
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4.4 Extension 

The f-exclusion protocol works on oriented bi-directional rings but in this kind 
of systems we can use this property to improve the protocol in order to obtain a 
version which tolerates the loss of messages. Moreover, we can reduce the value 
of the constant Max to 3{LMax + 1) by using the self-stabilizing alternating bit 
protoeol in PJ the self-stabilizing three state ring protoeol in m We can 
also design a version for rings with unbounded communication links by using 
a randomized version of the self-stabilizing alternating bit protocol last, 

using a self-stabilizing tree construction and the Euler tour of the tree to build 
a virtual ring, any of these versions can work on general systems. Unfortunately, 
with such a solution, the stabilization time of the ^exclusion protocol depends 
on n instead of h (the tree height) as we could expect. 

5 Self-Stabilizing ^-Exclusion Protocol on Trees 

In this section, we present a self-stabilizing £-exclusion protocol on rooted tree 
networks. The links are bi-directional. We assume that each processor i {i ^ 
Root) knows its parent, denoted by MyP. We denote the set of children of any 
processor by Children. Each processor (except leaf) locally distinguishes its chil- 
dren by some ordering denoted by — l if i is an internal processor and l..Ai 
if i = Root, where Ai is the degree of i. 

The basic idea of this algorithm is as follows: Root infinitely often sends a 
wave of E-tokens (0...£ E-tokens) down the tree. Each E-token follows a branch 
consistently from Root to a leaf. Once arrived to the leaf, an E-token disappears. 
Before to send a new wave of E-tokens, Root uses a controller in order to de- 
tect the number of E-tokens of the previous wave which are not disappeared 
yet. The self-stabilizing Tree ^-exclusion protocol is a parametric composition of 
Tree-Controller (Algorithm^ and £- Token-Distribution (Algorithm]^. 



5.1 Tree- Controller 

The basis of Tree- Controller is the Propagation of Information with Feedback 
(PIE) in In the PIE scheme. Root broadcasts a sequence of values {SeqVal, 
see Algorithm^ to every processor in the network. In order to detect when the 
current broadcast has terminated Root needs to obtain feedback (of the same 
value it sent) from the others processors. When an internal processor receives a 
new value from its parent, then it broadcasts this value to its children. When a 
leaf of the tree gets a new value, it simply sends an acknowledgment up to its 
parent (the same value) . An internal processor sends an acknowledgment up to 
its parent, when it has received acknowledgments from all its children. When 
Root receives an acknowledgment from all its children. Root starts to broadcast 
a new value. 

More formally we can define the specification of the PIE Scheme from that 
of PIE Cycle as follows: 
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Algorithm 4 Tree-Controller. 


Function Forward(Mes) 




Function Feedback(Mes) 


for all j G Children do send Mes to j 




send Mes to MyP 


end Function 




end Function 


For Root 


For an internal processor 


TREE-CONTROLLER(START, SIGNAL) 


TREE-CONTROLLER(STOP: Boolean) 


Variables MySeq: O..Max 


Variables MySeq: O..Max 


Answer[j>] : boolean flag for each 


Answer[j>] : boolean flag for each 


j G Children 




j G Children 


begin 


begin 


(ttci) [] receive SeqVal from j G Children ^ 


(oc4) [] receive SeqVal from MyP 


if {MySeq — SeqVal) then 




if {MySeq ^ SeqVal) then 


Answer[j] true 




MySeq SeqVal 


if (Vj G Children :: 




for all j G Children do 


Answer[j] — true) then 




Answer[j] false 


MySeqVal MySeqVal + 1 




if STOP then 


for all j G Children do 




Feed back ( Mar fc) 


Answer[j] false 




Forward (My Seg) 


START 






Forward (My Seg) 






(o.c 2 ) [] receive Mark from j G Children ^ 


(c^cfi) [] receive SeqVal from j G Children 


SIGNAL 




^ 






if {MySeq — SeqVal) then 






Answer[j] := True 


(ocs) [] timeout »■ 




if (Vj G Children :: 


Forward (My Seg) 




Answer[j] — true) then 






Feed back ( My Sey) 




(a^s) [1 receive Mark from MyP * 






Feedback ( Mar fc) 


end 


|end 


For a leaf processor 






TREE-CONTROLLER ( STOP: Boolean ) 






Variables MySeq: O..Max 






begin 






i^c?) [] receive SeqVal from MyP »■ if {MySeq ^ SeqVal) then 


MySeq SeqVal 


if STOP then Feedback(Marfc) 


Feed back ( My Sey) 


end 







Specification 1 (PIF Cycle). 

[PIFl] Root initiates a cycle by broadcasting a message m. 

[PIF2] The cycle terminates in Root. 

[PIF3] Root terminates the cycle if and only if all processors acknowledged the 
receipt of m. 



Specification 2 (PIF Scheme). The PIF scheme is an infinite sequence of 
PIF Cycles. 

In Tree-Controller we add a special message: Mark. This message is sent 
to Root by a processor in critical section (see function STOP) when it receives 
a SeqVal message. This Mark message informs Root that an E-token is used. 
Tree-Controller informs Root that the PIF has just started by using Function 

START. 
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5.2 Self- Stabilizing Token-Distribution 

To explain the behavior of ^-Token-Distribution (see Algorithm |3) we assume 
that Tree-Controller is stabilized. 



Algorithm 5 AToken-Distribution. 

^-TOKEN-DISTRIBUTION(STATE: { Request, In, Out }, ECS) 



For Root 


For an internal processor 


AT ( Ch 1 if Ch < A 


AT { Ch-\- 1 if Ch < A- 1 


iViacro IS ext — < 

1 1 Otherwise 


iVLacro H ext — < . r\^x. 

1 1 Otherwise 


External 


External 


TREE-CONTROLLER(START, SIGNAL) 


TREE-CONTROLLER(INCS) 


Parameters 


Parameters 


Function START() 


Function INCS(): Boolean 


if (STATE = Request)/\ 


Return(STATE — In) 


— Cpt) > 0) then 


end Function 


T 1 




for fc = 1 to i-{Cpt + T) do 


Variables T : 0..1 


send Token to Ch 




Ch Next 


begin 


Cpt 0 


( 0 , 12 ) [] (STATE G {Request, Out}) 


end Function 


A(T= 1) ^ 

if STATE = Out then 


Function SIGNAL)) 


T 0 


If Cpt < i then 


send Token to Ch 


Cpt Cpt + 1 


Ch Next 


end Function 


else ECS 


Variables T : 0..1 


(a^s) [] receive Token from MyP 


Cpt : 0..i 


if STATE = Request then 
T := 1 


begin 


ECS 


(ail) [] (STATE G {Request, Out}) 


else send Token to Ch 


A(T = 1) ^ 


Ch Next 


if STATE = Out then 




T 0 


(^ 14 ) [] (receive Token from MyP) 


send Token to Ch 


A(T= 1) ^ 


Ch Next 


send Token to Ch 


else ECS 


Ch Next 


end 


end 



For a leaf processor 

External TREE-CONTROLLER(INCS: Boolean) 

Parameters Function INCS : Boolean 
Return(STATE — In) 
end Function 
Variables T : 0..1 

begin 

(a^s) [] (STATE G {Request, Out}) A (T = 1) * if STATE = Out then T 0 else ECS 

(aie) [] (receive Token from MyP) A (T — 0) if STATE — Request then T 1; ECS 

end 



We use a switch mechanism HH to ensure the fairness. The switch mechanism 
is implemented at Root and every internal processor by using a pointer variable 
Ch, which holds a value in Children. By using this mechanism, E-tokens are 
sent in a breadth-first manner. We use a macro (not a variable but dynami- 
cally evaluated) called Next to choose the next child to visit. Recall that every 
processor maintains a local ordering among its children. 
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Each time Tree-Controller calls Function START, Root sends a wave of E- 
tokens. Remark that if Root requests to enter its critical section, it keeps one 
E-token and enters its critical section (See Function START and Action an). 

When a processor i receives an E-token from its parent, i does the following: 
If i requests to enter its critical section (State = Request), then i sets T = 1 and 
enters its critical section (See Action ai 2 for an internal processor and Action o/e 
for a leaf processor). Otherwise, either i is an internal processor and i sends the 
E-token to Ch (a^) or i is a leaf processor and the E-token disappears. When 
a processor i exits its critical section, it sends its E-token to Chi if i is Root 
or an internal processor (See Action an for Root and Action oj 2 for an internal 
processor) otherwise the E-token disappears (See Action an). 

The principle of Tree-Controller is not the same as Ring-Controller. It al- 
lows Root to count the E-tokens which are may be still used (Cpt). Once Tree- 
Controller is stabilized, no processor in critical section can send more than one 
Mark message. Moreover, the links are FIFO, so we are sure that the unused 
E-tokens have disappeared at the end of the control broadcast phase. So when 
Root sends £ — Cpt E-tokens, the total number of E-tokens is always less than 
or equal to £. 

5.3 Correctness Proof 

The progression of the PIE cycle and the E-tokens are almost independent since 
an E-token does not stop the broadcast message. They just synchronize at the 
beginning of the PIE cycle (Action Od and Function START). So we can claim: 

Lemma 5. The composition Tree-Controller [>p f -Token-Distribution is fair 
w.r.t. Tree-Controller . 



Lemma 6. The composition Tree-Controller c>p f -Token-Distribution is fair 
w.r.t. £-Token-Distribution if the Tree-Controller is stabilized. 

Proof. Suppose in the way of contradiction that there exists an execution of 
the composition which contain no steps of the £-exclusion protocol. This implies 
that no E-token is present in the tree. Since, Root initiates infinitely often the 
controller and after 2h time, Root detects that there is no E-token in the system. 
Then it can add £ E-tokens when it initiates the next controller, a contradiction. 

□ 

From PH] we can claim that the Tree- Controller stabilizes for the specifica- 
tion Q if Max > nLmax, where Lmax is the maximum capacity of communication 
links, and n is the number of tree nodes. So, we can claim the following result: 

Lemma 7. Tree-Controller stabilizes for Specification]^ even j/£-Token -Distrib- 
ution is not stabilized. 

Lemma 8. (Safety) Assuming Tree-Controller is stabilized for Specification]^ 
then the system will eventually contain at most i E-tokens forever. 
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Proof. Let a configuration j start be the end of the PIF cycle of the Tree- Con- 
troller and Cpt start the value of Cpt at ^ start- We denote by '^Markstart the 
number of Mark messages received by the root during the PIF cycle and by 
'^E-tokens start the number of E-tokens in the tree at ^ start- By Action Oc 2 and 
Function SIGNAL, we have Cptstart = 'rnin{'^M ark start -,(■)- Since during the PIF 
cycle no extra E-token is created but an E-token which generated a Mark mes- 
sage may disappear at a leaf, we can claim that '^Markstart > '^E-tokens start - 
Then when Root initiates the next PIF cycle, we consider two cases. 

1. '^Markstart > In this case Cptstart = ^ and Root does not send any new 

E-token but eventually starts a new PIF cycle (see Function START). So if 
'^E-tokens start > then Mark start > ^ and Root does not send any new E- 

token. Since E-tokens eventually disappear at a leaf the number of E-tokens 
decreases until it becomes less than £. 

2. '^M ark start < C In this case Cptstart = ^M arkstart and Root sends i — 
Cptstart < ^ — '^E-tokens start- So the number of E-tokens — Cptstart + iE- 
tokenSstart) is always less than or equal to i. 

□ 

We need the following definition to prove the next lemma. A path pij is a se- 
quence (joAi) •■•j ix-i,ix) such that the following conditions are true: (l)i = zq, 
(2)j = Za;, (3)a: > 1, (iv)ym G [l,x — 1], im = Pim+i^ where Pi denotes the 
parent of z. The length a; of a path pij is denoted by Jlij. 

Lemma 9. (Fairness) Assuming Tree-Controller is stabilized for Specification 
in any execution e, each processor i receives (at least) an E-token every W 
E-tokens sent by Root such that W = iHjGurP- iiChildrenj)) ifi^r and W =1 
if i = r. 

Proof. We will prove this theorem by induction on the length of the path pri 
from Root to i (jlfi). 

1. Basic step. The case where z = r is obvious. Consider the case where jrfi = 1, 
so i G ChildreUr. It is easy to see that if VF = (,Childrenr, then by the switch 
mechanism each processor z receives an E-token. 

2. Induction step. Assume that the theorem is true for Jlfi < k — 1 , k > 1. 
Consider the case where JCf = k. Let W be an integer such that 



W = ( {^Childrenj)) = {(,Childrenpf){ ((Childrenj)) 

j&V-rPi jeUr-Pp. 

By hypothesis. Pi receives at least 'iChildrenp^ E-tokens every W E-tokens 
sent by Root, by the switch mechanism each of its children receives an E-token, 
in particular Processor i. □ 



Lemma 10. (£-Liveness) Assuming Tree-Controller is stabilized for Specifica- 
tion\^ any execution e satisfies the i-liveness property. 
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Proof. Suppose in the way of contradiction that there exist an execution e such 
that X, X < i, processors execute the critical section forever and any other re- 
questing processor can never get any E-token. 

Since no other processor can enter the critical section forever then the i — x 
E-tokens are unused forever. So, eventually each PIF wave of Tree-Controller 
will send exactly x Mark messages to Root forever. Then between each PIF 
wave Root will send exactly £ — x E-tokens. Since ^ — a; > 0, by Lemma 0 any 
processor will eventually receive an E-token. This contradicts our assumption. 

□ 



From Lemmas 00, andEUwe can claim the following result: 

Lemma 11. Assuming Tree-Controller is stabilized for Specification]^ then £- 
Token-Distribution stabilizes for the £-exclusion specification. 

So from Theorem 0 and Lemmas 00111 and El we can claim the following 
theorem: 

Theorem 3. Tree-Controller c>p £- Token-Distribution stabilizes for the l-exclu- 
sion specification. 

Stabilization Time Complexity. The stabilization time of Tree- Controller is Ah+2 
time m- Once Tree- Controller is stabilized, after a new PIF wave Root will 
never send too much E-tokens in the system. So the stabilization time of the 
composition is bounded by 6 -I- 2 time. 

5.4 Extension 

The tree ^-exclusion protocol has a drawback: the number of times that another 
processor can enter the critical section before a processor p can do it 0{CH^ x h), 
where CH = ma,x{^Childreni). We can remove this drawback by changing the 

i 

switch mechanism as follows: every requesting processor sends a request to Root, 
every intermediate processor enqueues the link number the request is coming 
from. Now the E-tokens follow the paths described by the local queues. After 
they have been consumed the E-tokens are erased by the processors they used 
them. In this case the delay is only 0(n). But the drawback of this solution is 
that the size of the queue is 0(n). Nevertheless, we can also improve the algo- 
rithm as in rings in order to obtain a version which tolerates the lost of messages. 
We can also reduce the value of the constant Max to 3(LMax + 1) by using the 
self-stabilizing alternating bit protocol in 0 and the self-stabilizing three state 
PIF protocol in 0. 

6 Conclusion 

In this paper, we present the first self-stabilizing protocols for ^-exclusion prob- 
lem in the message passing model. We propose a new technique for the design of 
self-stabilizing ^-exclusion: the controller. This tool allows to count the tokens 
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of the system without any counter variable for all processors except one called 
Root. We also introduce a new protocol composition called parametric compo- 
sition. This is a generalization of the collateral composition in m and of the 
conditional composition in 0. Then we present protocols on rings and on trees. 
The space requirement of both algorithms is independent of i for all processors 
except Root. The stabilization time of the first protocol is 3n time, where n is the 
ring size and the stabilization time of the second one is 6h -I- 2 time, where h is 
the tree height. We also discuss numerous adaptations following various models. 
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Abstract. This paper introduces the problem of n mobile agents that 
repeatedly visit all n nodes of a given network, subject to the constraint 
that no two agents can simultaneously occupy a node. It is shown for 
a tree network and a synchronous model that this problem has 0{An) 
upper and lower time bounds where A is the maximum degree of any 
vertex in the communication network. The synchronous algorithm is self- 
stabilizing and can also be used for an asynchronous system. A second 
algorithm is presented and analyzed to show 0(n) round complexity for 
the case of a line of n asynchronous processes. 



1 Introduction 

A fundamental task for (mobile) agent-based computing is search, or visitation, 
of a group of network nodes. Agents are convenient programming entities for 
distributed systems because they encapsulate procedures for a sequence of loca- 
tions without having to specify programs at each location. A collection of agents 
can have different functions or cooperatively work for a single function (some re- 
cent research metaphorically calls collections of agents “swarms of insects” ) . 
Ideally, agents are autonomous, and agent-based computation is self-stabilizing. 
Autonomous agents are capable of independent action in open, unpredictable en- 
vironments. If we interpret a mismatch between the internal variables of agents 
and the environmental variables as a faulty condition, then autonomous agents 
can be fault tolerant by taking local actions to correct their internal variables to 
be consistent with the environment. The paradigm of self-stabilization general- 
izes this problem, tolerating even initial situations where internal variables are 
mutually inconsistent. This generalization can simplify agent design by avoiding 
detailed case analysis based on all the ways that the environment can change. 
These considerations motivate the study of self-stabilizing protocols based on 
the mobile agent paradigm. 

We suppose in this paper that agents require considerable resource support 
with respect to the host platform. For instance, the host could be an embedded 
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controller with limited memory and processing power. To model this kind of lim- 
itation, we propose the following constraint: no node can contain more than one 
agent at a time. Therefore, for an agent to move about the system, other agents 
may need to be displaced in its path. In this paper, the fundamental operation 
that satisfies this constraint is the agent swap operation. Agents at neighboring 
nodes are permitted to exchange locations in a single atomic swap operation. 

Related Work. The topic of mobile agents in a self-stabilizing system appears 
in and is also motivated by 0. Other papers that study mobility issues 
in the context of self-stabilization include miBi. Except for some motivation, 
general programming treatments of mobile agents ini have little relation to 
this paper. Remarks in the conclusion explain that previous work on support- 
ing serial execution models in distributed systems for self-stabilizing algorithms 
are related by similar techniques to the algorithms presented 
here. The protocols presented in Sections 14. 1 l a.nd Flha.ve the simple character of 
wave algorithms (though there are no initiators or decide events), and phase 
clock protocols unm]. 

Summary of Results and Contributions. The main results of this paper are the 
definition of the problem of n-agent traversal, lemmas that show lower bound 
time complexity in a tree, and a construction with an upper bound complexity 
matching this lower bound. An additional algorithm solving the problem for a 
tree is very simple (essentially requiring no convergence, since every state is le- 
gitimate) and we analyze its time complexity for the case of a linear array of 
processes. 

2 Network, Agents, and Self-Stabilization 

A complete description of a mobile agent architecture would discern between 
variables that accompany agents, the so-called briefcase variables of cni, and 
variables that are attached to hosts, that can be read and written by agents. 
Since the focus of this paper is a certain behavior of agents (traversing the net- 
work) rather than the application invoked by agents as they travel, we can use 
a simplified model. We specify the traversal protocol using traditional guarded 
assignments, but extend the execution model to support the swap operation. 

The system consists of n > 1 nodes in a network described by a connected, 
undirected graph; let j G Ni denote that node j is a neighbor of node i in the 
graph. Let St be the degree of node i in the graph, that is, St = |W|- We suppose 
Ni has a fixed ordering, and Ni[k] denotes the k-th node in this ordering. 

Programs for the nodes are specified by guarded assignments of the form 
9 i Si, where gi is a boolean function and Si is an assignment to the variables 
of node z; both gi and Si may refer to variables of nodes in the set {z} U W- The 
guarded assignments are also called actions of the system. 

A state of the system is a specification of values for the variables of all nodes. 
With respect to any given state cr, a guarded assignment gi Si is termed en- 
abled if gi holds at cr and is otherwise disabled. We assume that programs satisfy 
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the following property: if gi — *■ Si is enabled at a state cr, and execution of the 
assignment Si at cr results in state a' , then gi — > Si is disabled at a' . Thus any 
enabled guarded assignment will disable itself if executed. 

One variable of each node is reserved for an entity called an agent. The do- 
main of such a variable is such that it always “contains” one agent. We suppose 
that each agent has a unique identity, and a basic axiom for any state (including 
erroneous initial states) is that no agent can be located at more than one node 
at any state. 

Let us review the standard execution semantics for guarded assignments typ- 
ically used for self-stabilizing algorithms. A synchronous computation of the sys- 
tem is a maximal sequence of states so that each consecutive pair of states (cr, a') 
includes the (parallel) execution of one guarded assignment at each node that 
has an enabled guarded assignment. An asynchronous computation of the system 
is a maximal sequence of states so that each consecutive pair of states (cr, cr') cor- 
responds to the execution of one enabled guarded assignment at state cr. Within 
a computation (synchronous or asynchronous) any pair of consecutive states is 
called a step of the computation. A weakly fair asynchronous computation is 
an asynchronous computation containing no suffix in which some node has an 
enabled guarded assignment at each state in the suffix (because we suppose each 
node has only one guarded assignment, the condition of being continuously en- 
abled implies that the node does not execute the assignment throughout the 
sequence). In an asynchronous computation, time is measured by rounds. The 
first round is the minimal prefix of the computation so that, for each node i, 
either i’s guarded assignment is disabled at some state of the prefix, or some 
consecutive pair of states corresponds to the execution of a guarded assignment 
by node i. Round i, for i > 1 is defined recursively, by applying the definition of 
the first round to the remaining suffix of states in the computation. 

We now propose an extension of the execution semantics of guarded as- 
signments to include an agent swap operation. An agent swap operation in an 
asynchronous computation is a consecutive pair of states (cr,cr') that changes 
variables of two neighboring nodes, say i and j. If at cr two agents X and Y are 
located at nodes i and j respectively, then an agent swap reverses the locations of 
X and Y. The swap operation (cr, a') may also change values of (non-agent) vari- 
ables at nodes i and j. This notion of agent swap can be generalized to an agent 
move, for the case of an agent atomically moving from one node to another, and 
agent swaps for synchronous computations can be defined in the obvious way 
(we omit details) . While it is unusual to consider the type of parallel assignment 
for an agent swap in guarded assignments of self-stabilizing algorithms, recall 
that CSP’s communication mechanism PS] does something similar, changing the 
local state of two processes atomically, based on the “willingness” of both par- 
ties. Rather than developing a formal programming notation for agent swap, we 
use informal descriptions for the few protocols presented in subsequent sections. 

A subset B of all possible computations (synchronous or asynchronous) spec- 
ifies the legitimate behaviors of the system. A system is self-stabilizing if every 
computation has a suffix contained in the set B. The stabilization time of a 
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system is the worst case, taken over all possible computations, of the length of 
the minimum prefix beyond which the remaining suffix is an element of B. For 
asynchronous computations, the stabilization time is determined by the number 
of the rounds in the prefix rather than its length. 

3 Problem Statement and Lower Bound 

Given are n distinct agents, initially present at arbitrary nodes of the network 
so that no two agents are located at a common node. Initial values of non-agent 
variables at the nodes can be arbitrary. Legitimate behavior for the system is 
any computation so that, for each agent A and node p, the agent A is located 
at node p at infinitely many states. The n-agent traversal problem is to specify 
a protocol so that all its computations are legitimate behaviors. 

Efficiency of a solution to the traversal problem is measured by the time 
required for all agents to visit all nodes. An agent walk is a sequence of nodes 
so that there is a graph edge for every consecutive pair of nodes. A traversal is 
an agent walk that includes each node at least once and has the same node for 
the initial and final item in the sequence. Efficiency is determined by measuring 
the worst-case minimum time for every agent to complete a traversal. A trivial 
lower bound of I7(n) for all agents to complete a traversal applies to all networks, 
however tighter lower bounds on efficiency are topology-specific. For the case of 
a tree network, we have a more precise lower bound. 

Lemma 1. Given a tree network with maximum degree A, the efficiency of any 
n-agent traversal protocol is f2{An). 

Proof. By counting agent visits to a particular node, the bound is derived. Con- 
sider first the case of a synchronous computation. Let v be some node having 
degree A in the communication graph. Let P be the agent initially located at 
V. To complete a traversal, P leaves v and returns to v at least once for each 
neighbor (this follows because the graph is a tree). Because each agent swap can 
contribute to two traversals, we count time only for agents that move to u by a 
swap. In this accounting, agent P’s contribution is Z\ — once for each neighbor 
of v. Consider some agent Q ^ P. Agent Q enters v and departs from v at least 
once for each edge incident on v, hence the accounting for Q is the same as for 
P. Thus, all n agents contribute A swap operations just for arrivals at v, which 
provides an Q{An) bound. For an asynchronous computation, we have only to 
consider computations in which the operation by v (if any is enabled) occurs last 
in the round. This ensures that v participates in at most one swap per round, 
and the HfAn) bound applies. □ 

4 Phase-Based Algorithm 

The main result of this section is an algorithm that completes n-agent traversal 
for a tree in 0{An) time; the algorithm also applies to rings, but does not work 
for all topologies. We present the algorithm first for a synchronous computation 
and then discuss adaptation to the asynchronous model. 
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4.1 Synchronous Algorithm 

Two building blocks for the algorithm are edge coloring Q and phase synchro- 
nization mu, which are tasks having published self-stabilizing solutions. Colors 
are represented by natural numbers in the domain [0, {A — 1)] and every node v 
has a non-agent variable colory [u>] for each neighbor w G Ny (in the presentation 
below, V and w are implicitly typed as neighboring nodes). Phases are natural 
numbers, and every node v has a non-agent variable clocky containing a phase 
number. The domain R of phase numbers is either infinite or a given domain 
[0,TO — 1] for some m > A. A self-stabilizing edge coloring algorithm reaches a 
fixed point satisfying the Colored predicate: 

Colored = (\/v,w :: colory[w] G [0, {A — 1)]) 

A (Vu,?n :: colory[w] = colorw[v]) 

A (Vu :: |{ colory[w] | w G }| = (5„) 

A self-stabilizing phase synchronization protocol converges to a stable Synchro- 
nized predicate (recall that we consider a synchronous model of computation for 
this first algorithm): 

Synchronized = (Vu :: clocky G R) A (Vu,?n:: clocky = clocks) 

Additionally, a phase synchronization protocol satisfies this liveness property: 
the value of clocky increments in each step of the computation if R is infinite, 
and increments modulo m if i? is finite. 

Our traversal algorithm is the parallel composition 0 of stabilizing coloring, 
stabilizing phase synchronization, and a swapping protocol. In the case where the 
phase clock is unbounded, the swapping protocol is given by the single guarded 
action (per node) 

{clocky mod A) = color y[w] swap agent at v with agent at w 

If the phase clock has finite domain [0, m— 1] for m> A, then there exists g, the 
largest integer that is a multiple of A and at most m, for which the swapping 
protocol is given by 

{cloeky < g) A {clocky mod A) = colory [w] 

—f swap agent at v with agent at w 

At each time unit, a node either initiates an agent swap or the node does nothing 
more than increment its phase clock. 

Lemma 2. Let the communication graph be a tree network and let cr be a sys- 
tem state satisfying Colored and Synchronized predicates. Let A be an agent 
located at some node v such that eolory[w] = clocky at state a. Let T be the 
subtree of the communication graph formed from the union of all paths that 
contain w and have v as one endpoint of the path. Then, in any subsequent 
computation, agent A crosses over each edge of T exactly twice before returning 
to V. 
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Proof. Note first that, for any state satisfying Colored and Synchronized, that 
every agent eventually moves via an agent swap operation. This follows from the 
liveness property of the phase clock and the edge coloring, which ensures that 
every color enables a swap infinitely often in a computation. The remainder of 
the proof is by induction on the size of T. The first swap at v following state cr 
sends agent A to node w. If w is a leaf of T, then the next swap by w returns 
A to V and proves the claim. Otherwise, by hypothesis, agent A completes a 
tour of one subtree rooted at w (that doesn’t include v) and returns to w, the 
return to w also advances the phase clock to the next color, and thus selects 
another subtree. Because the ordering of colors and edge coloring is fixed, only 
after visiting all subtrees of w (an visiting none of these more than once), agent 
A’s next swap returns it to v, which completes the proof. □ 



Theorem 1. The composition of a stabilizing coloring algorithm, a stabilizing 
phase clock, and the agent swapping protocol is a stabilizing solution to the 
n-agent traversal problem for tree networks. 

Proof. The stabilization of the coloring and phase clock are given. The previous 
lemma shows that a depth-first search order of agent circulation holds for a sub- 
tree; the same argument applies in succession to all subtrees, which demonstrates 
agent traversal. □ 



Lemma 3. Starting from a state satisfying Colored and Synchronized, all agents 
traverse the tree network within 0(An) time. 

Proof. We show the proof for a particular case, that is, where the clock modulus 
m satisfies m mod Z\ = 0 and where the initial state has clocky = 0 for every 
V G V (other cases have similar arguments). In the following A steps in a com- 
putation from such an initial state, there is exactly one swap for each edge of 
the network, thus there are n — 1 swaps in A steps. Because each agent moves 
only in a DFS tour, each agent uses exactly 2(n — 1) swaps to complete its tour. 
All agents move in parallel and each agent moves at least once in each sequence 
of A steps. So after 2(n — 1)Z\ steps, every agent has completed traversal. □ 

Lemma Elshows that the agent traversal algorithm is asymptotically optimal, 
since the traversal time matches the lower bound. Moreover, by examining the 
algorithm with much care, we can show that all agents traverse the tree network 
exactly in An time that exactly matches the lower bound. The asymptotic op- 
timality also holds for a suboptimal coloring, provided the number of colors is 
0{A). 

Lemma 4. The composition of a stabilizing coloring algorithm, a stabilizing 
phase clock, and the agent swapping protocol is a stabilizing solution to the 
n-agent traversal problem for ring networks. 
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Proof. First we establish that each agent in the ring has a fixed orientation 
(clockwise or counterclockwise) throughout any computation. Since, by the swap- 
ping protocol, all colors selected for swaps infinitely often in any computation, 
the result follows. By properties of the phase clock and the Colored predicate, 
there occurs a swap along each edge of the communication graph infinitely often 
in any computation. Therefore, we may consider an arbitrary agent A and the 
first time it swaps with some neighboring agent. This event occurs on an edge 
(t;, w) with some color r, moving A from v to w. In the ring, there are two edges 
incident on w, namely {v,w) and (w,x), where {v,x) has color s^r. The next 
swap event that involves A is along (w, x), sending A to x, since the phase clock’s 
activation of color s occurs before color r, because colors are selected cyclically 
and r was the last color selection. This establishes that agent A has the orien- 
tation (thus far in its movement) of “z; to w to a;.” By an inductive argument it 
follows that agent A retains the same orientation continuously around the ring. 
□ 



To establish asymptotically optimal efficiency of the swapping protocol for 
the ring, we require that the coloring algorithm use only a constant number of 
colors, since this implies that each edge is engaged in a swap within a constant 
number of steps, so that each agent completes traversal of the ring within 0(n) 
steps. 

That the agent traversal algorithm fails to solve n-agent traversal for general 
communication graphs is shown by the counter-example of Figure H The colors 
0, 1, and 2 are shown on the edges of this 4- vertex graph. Agent a is located 
centrally in the figure, and clocks start at zero. From this initial state, agent a 
perpetually swaps along the graph’s cycle, but fails to visit the leaf node. 




Fig. 1. Counter example for agent traversal. 



The algorithm does solve n-agent traversal on graphs that are hybrids of 
trees and a ring; in fact, it is not difficult to show there exists a coloring (gener- 
ally far from optimum) for graphs with an Euler walk, so that agent swaps are 
sequenced to move agents cyclically over all edges of the communication graph. 
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4.2 Adaptation to Asynchronous Computation 



The synchronous algorithm can be adapted to the asynchronous case by using a 
stabilizing asynchronous phase clock and a stabilizing asynchronous edge color- 
ing algorithm. A technical challenge for such an adaptation is that neighboring 
clocks can differ in the asynchronous case, so extra measures are needed to pre- 
vent conflict in the scheduling of agent swaps. Recall that an invariant for an 
asynchronous phase clock is (Vu :: legitClock^) where 

legitClock^ = (Vw : w G Ny : {clocks — clockw) mod m < 1 V 

{clockTju — clocky) mod m < 1 ) 

for some constant m. Informally, the invariant is that neighboring clock variables 
differ by at most one (the protocol of has a more complicated definition of a 
legitimate state because variables other than clock are introduced; however for 
our purposes, legitClock suffices). The condition for a clock to increment (mod- 
ulo some constant) is that its value be “minimal” with respect to all neighboring 
node clock variables, where minimal is with respect to the a cyclic ordering A 
of the range of the clock modulus (called the “beh” relation in [bj). 

We suggest the following modification to the phase clock protocol. The guard 
for advancing the phase clock at v should be strengthened in the locally legit- 
imate case, that is, when clocks is minimal for its neighborhood (to simplify 
presentation, we are considering the case m mod Z\ = 0). This advance of the 
phase should be prevented in the case where two neighboring clock values are 
equal and the edge color for these neighbors indicates a swap — then the value 
of both clocks should increment in parallel, accompanied by an agent swap, in 
one step of the computation. Figure El illustrates the situation for a network of 
five nodes. Edges in the figure are labeled with colors in the domain [0,3] and 
the ordering of clock values satisfies 0 A 1 A 2 • • • ^ (to — 1) A 0. Nodes are 
labeled by their clock values. In this state, the leaf node with clock = 1 would, in 
the usual asynchronous phase clock protocol, be enabled to increment (modulo 
to) because its clock is locally minimal. We have proposed that a swap should 
accompany such an increment since the edge color is equal to the phase number, 
however, this cannot occur unless both nodes incident on the edge simultaneously 
advance their clock variables, and local minimality does not hold for the figure’s 
central node. 

To implement our proposal, let each node v have a new boolean variable by. 
The protocol for v has a new action 

~^by A (Vw : w G Ny : clocky A clocky,) — *■ by := true 

and any change to clocky is modified to also falsify by. With these modifications, 
when by is true in a legitimate state, then v has a locally minimal clock. To 
define legitimacy, we consider both clock and color values. A legitimate state is 



160 



Ted Herman and Toshimitsu Masuzawa 



one satisfying (Vi; :: legit^) where 
legit^ = legitClocky A 

{by (Vie : w G Ny : clocky ^ clockyy)) A 

(Vw : w G Ny : {clocky mod A) = coloryyy clockyy ^ clocky) 

The last conjunct of legity is needed to specify that when v^s clock advances to 
equal (mod A) the color of an incident edge {v, w), it should not be the case that 
clocky -< clocky, — since deadlock would result by v waiting for w to advance its 
phase “backward” to equal the {v, w) edge color. Notice that the definition of 
legity has been engineered to be locally testable. 

Node v’s swap protocol combined with the phase clock protocol has the fol- 
lowing form (omitted are corrective actions for illegitimate states jH| and the 
assignments to by mentioned above). 

(Vic : w G Ny : -^legity V {clocky mod A) ^ coloTyy,) 
use normal phase cloek protoeol 

[] ([]ic : w G Ny : legity A {clocky mod A) = colovyy, A by A by, —>■ 

advance eloeky ; 
swap agents with node w ; 
by := false ) 

Recall that synchronously parallel assignment to variables of two neighboring 
nodes is implicit in our model, which enables agent swap as a fundamental 
primitive (even in an asynchronous computation). 




Fig. 2. Example of asynchronous phase clock. 



5 Non-phased Algorithm 

The agent traversal algorithm of Section 14.11 relies on coloring and phase con- 
trol to correctly swap agents. This subsection introduces a very simple protocol 
that requires no such machinery. In fact, the protocol presented below has no 
illegitimate states (it is snap-stabilizing |210j) and is suited to both synchronous 
and asynchronous models of computation. This simple protocol does not apply 
to as many topologies as the protocol of Section 14. 1 1 — it is restricted to tree 
networks. 
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5.1 Presentation and Verification 

Figure 0 presents the new protocol. Only one non-agent variable is used by each 
node V, swapnext^, which is the index with respect to Ny of one neighbor. In- 
tuitively, swapnexty specifies which neighbor v should next engage in an agent 
swap operation, as soon as that neighbor agrees to the swap. Recall that a swap 
is an atomic step changing the state of two nodes. When node v swaps agents 
with neighbor w, then both swapnexty and swapnexty increment, modulo the 
size of their respective neighborhoods, in the same step. Node v can be viewed 
as “waiting” at states where Ny[swapnexty] = w, but Ny,[swapnexty,] yf v. This 
possibility of waiting motivates the following lemma to prove absence of deadlock 
(however deadlock can occur for a cyclic communication graph) . 



Ny[swapnext^] = w /\ Nm[swapnext^] = v 

swap agent at v with agent at w ; 
swapnexty := {swapnexty + 1) mod 5y 



Fig. 3. Agent traversal protocol for node v. 



Lemma 5. In any state, at least one agent swap is enabled for the protocol of 
Figure 0 

Proof. We define, for any state a, a directed graph G = (V, E) as follows. The set 
of vertices V is the set of nodes in the tree; edge {v, w) £ EiSw = Ny [swapnexty]. 
Observe that G contains a cycle {(u, u>), (w, u)} iff an agent swap between v and 
w is enabled at a. The proof consists of showing the existence of such a cycle. To 
show that G has at least one cycle, we use a path argument. For any leaf node 
V of the tree, there is an edge (v,wi) in E because the domain of swapnexty 
is the singleton set {0}. If (wi,v) G E then there is a cycle and the lemma 
holds; otherwise there is some edge (wi,W 2 ). Now we repeat this case analysis 
inductively, either finding a cycle or constructing a directed path w\,W 2 , . ■ . ,Wk 
such that Wk is a leaf vertex in the tree — which then implies a cycle because 
(wk,Wk-i) G E. □ 

Lemma 6. Each node participates in an agent swap infinitely often in any com- 
putation. 

Proof. Suppose, heading for a contradiction, that some node participates only 
finitely often in some computation. Then there exist x,y G V such that y par- 
ticipates in infinitely many swaps (by weak fairness and Lemma 0) and x does 
not, but X G Ny. However, because y increments nextswapy to eventually sat- 
isfy Ny[nextswapy] = x following any state of the computation, it follows that x 
thereafter swaps with y because no further step involving y can occur without a 
swap with X. □ 
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A corollary of Lemma 1^1 is that each agent is swapped infinitely often. To 
show other properties of the protocol, we borrow concepts from previous papers 
that implement self-stabilizing construction of a depth-first search tree. Given a 
rooted tree network with a fixed neighborhood enumeration (such as Ny) and 
distinct node identifiers, there is a lexicographic ordering of paths from the root 
to every other node in the tree. 

Lemma 7. The sequence of nodes visited by any agent is a DFS traversal of 
the network. 

Proof. Let r be an arbitrary node hosting an agent A in the initial state, which 
we consider to be the root for purposes of the proof. To fix the order of paths 
with respect to r, we show first by induction that agent A moves to a initial path 
from r to some leaf. For the base case, if r has a neighbor v and the first swap 
of A involves (r,v), then the initial path is established. Otherwise, A moves to 
some w G Ny so that Ny,[swapnexty,] r. Applying the argument inductively, 
A continues to move along some path, but the tree topology prevents the for- 
mation of a cycle, and because the number of nodes is finite, A arrives at some 
leaf node x after at most n — 1 swaps involving agent A. The second part of the 
proof exploits the fixed neighborhood enumeration. Suppose, inductively, that 
agent A has visited some proper subset W C V in DFS order with respect to 
root r with x as the first leaf visited. There are two cases for the last swap in 
such a visitation sequence. If the last swap places A at some node y it visited 
previously, then the backtrack swap moving Atoy either increments nextswapy 
to the next node in the DFS order (if such a previously unvisited node exists) 
or the increment of nextswapy refers to the first neighbor of y upon which A 
arrived, that is, the backtrack neighbor. This verifies the inductive hypothesis. 
□ 

Theorem 2. The protocol of Figure 0 is a stabilizing solution to the n-agent 
traversal problem. 

Proof. Lemma Q makes no particular assumption about the initial state, hence 
every computation moves all agents in DFS order, which is a solution to the 
n-agent traversal. □ 

5.2 Complexity for a Line 

The results of this subsection are for a line of n processes. In this linear topology. 
Si = 2 for all but the endpoints of the line, which simplifies the analysis. For any 
given state of the linear network, an agent A has an orientation, right or left, 
denoted A ^ or ^ A. If A and B are neighboring nodes with A to the left of B, 
then one of the four possibilities of their agent’s orientation is A B, which 
we denote by A ^ i?. 

Lemma 8. Agents swap in the direction of orientation; agents retains one ori- 
entation until swapped to an endpoint of the line; the orientation reverses at the 
endpoint of the line. 
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Proof. These simple properties result from the fact that each swap reverses the 
orientations at the pair of nodes involved, but a precondition for a swap is the 
^ condition. □ 



Lemma 9. If A and B have the same orientation and A is nearer to its target 
endpoint than B, then A continues to be nearer to its target endpoint so long 
as A and B have the same orientation. 



Proof. Up to when one of A or S reaches the endpoint node, it suffices to consider 
the case when the two are neighbors, since they would have to be neighbors as 
a precondition for one overtaking the other. However, if A and B are neighbors, 
preconditions A ^ B or B ^ A are not possible, by Lemma 0 □ 

It is possible to extend Lemma 0 to show that agents arrive at nodes in de- 
terministic order, that is, from an initial state cr, if the order of agent arrival at 
some node v is A = Ai, A 2 , A^, . . . for one computation, then the same order A 
of agent arrival holds for every computation beginning from cr. 

Let a sequence of consecutive nodes in the line be called an orientation chain 
if all their agents have the same orientation. For any given state the orientation 
chains are described by a tuple of the form (ri,fi,r 2 ,-^ 2 , ■ ■ ■ ,fkAk), where each 
term specifies the number of nodes in a ^ chain and ii specifies the number 
of nodes in the following <— chain (the endpoint orientations are constant). The 
short chain predicate Sc is defined to hold iff each term of the orientation tuple 
is 1, with possible exception of the endpoint terms, which are either 1 or 2. 

Lemma 10. In any synchronous computation, the protocol stabilizes to Sc 
within 0{n) steps. 

Proof. (Sketch.) It can be shown, for a synchronous computation, that with the 
exception of merging endpoint chains of length 1, no chain increases length by 
any step (the proof is by induction, using the fact that only swaps can modify 
chain lengths, and every chain terminates, following its orientation, in a ^ con- 
dition). An example of this exception is the following step: (1,3, 1,1) becomes 
(2,4), which is depicted as follows: 



Endpoint chains make the exception because orientations at the end of a chain 
do not reverse by a swap operations. For chains not subject to this exception, 
it remains only to show that chains of length t > 2 shorten in a computation. 
Consider a chain of length t and two subsequent steps of the computation. Since 
the chain terminates at a ^ pair, it will shorten by one node as the result of 
this swap; however, the chain could also lengthen because some swap adjacent 
to the opposite end adds a new node. Thus in one step, the chain either retains 
its length or shortens. In the case where the length is unchanged, let us suppose 
the chain lengthens by a new node to the left and diminishes from the right. 
This implies that the sum of terms occurring before the chain’s terms in the 



164 



Ted Herman and Toshimitsu Masuzawa 



orientation tuple decreases. Therefore, after at most 0(n) steps, the chain must 
decrease in length. Since these observations hold in parallel for all chains, the 
protocol stabilizes to Sc within 0(n) steps. □ 

Lemma 11. In any synchronous computation, every agent traverses the net- 
work within 0(n) steps. 

Proof. Once Sc holds, there occur at least n/4 swaps per step. Each agent there- 
fore advances in the DFS visitation at least once every four steps, so 4n steps 
suffice to guarantee n-agent traversal. □ 

The more difficult case is asynchronous computation. To prepare for the anal- 
ysis, we propose an ordering on the swap operations in a computation. Let a 
be an initial state and construct an infinite directed graph, which has vertices 
taken from the states of a synchronous computation. More precisely, the graph is 
constructed inductively from an initially empty set of vertices, and then adding 
the n vertices representing the nodes of the network for each step of the syn- 
chronous computation. In this construction, we label each vertex as f : i by the 
name of its network node v and the step number i from which it derives. For 
each step from time i to time i -I- 1, add an edge {v : i,w : i + 1) to the graph iff 
an agent moves from node v at time i to node w at time j -|- 1. By the arguments 
of Lemma CH the number of edges added for each step eventually becomes at 
least n/2. By extension of Lemma 0 the order defined by this graph is the same 
dependency order for agent swaps in any computation, including asynchronous 
ones. 

Now consider an asynchronous computation, and the same graph construc- 
tion to model the computation as in the previous paragraph. Here, in addition 
to labeling vertices by node and step, we may add a third label to specify the 
round number, so each vertex has a label of the form v : i : j. The step number i 
refers to the step number of the synchronous computation where this swap would 
occur rather than the step number in the asynchronous computation. Although 
our computational model is not a message-based one, we can use the notion of a 
consistent cut to relate states from the asynchronous graph to the synchronous 
one. A cut is a subgraph that causally includes, for each of its vertices other 
than ones from the initial state, all predecessor edges and vertices. 

Lemma 12. Let v : i he a, vertex with maximum value f in a cut and w : k he 
a vertex with minimum value k in the same cut; then i — k < n. 

Proof. (Sketch.) An asynchronous computation allows some swaps to occur in 
different order from the synchronous computation, but since agents move in DFS 
order and the graph dependencies for swaps are the same for any computation, 
dependencies “fan out” to include all n nodes if one agent swaps n — 2 times in 
some period while other agents do not swap. □ 

Lemma 13. In any asynchronous computation, every agent traverses the net- 
work within 0(n) rounds. 
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Proof. (Sketch.) In each round, every agent located at a vertex of the form w : k 
where k is minimal completes a swap because all causal dependencies are sat- 
isfied. The total “stretch” for such dependencies is n by Lemma ca hence any 
asynchronous computation’s maximum cut is at most 0{n) swaps behind some 
synchronous computation with the comparable time. □ 

6 Conclusion 

This paper defines and investigates the problem of n-agent traversal in a model 
where only one agent per node is permitted in any state. The basic operation 
for agent movement is the agent swap, which differs from previous research in 
self-stabilization because the swap operation atomically changes the state of two 
processes. Although this model of mutual engagement differs from classical work 
in the area, the algorithmic techniques we used are widely used. Readers familiar 
with central daemon emulations will recognize one of the basic requirements of 
the model: two simultaneous swaps are not allowed if they are incident on a com- 
mon node (in the central daemon emulation, neighbors are not allowed to execute 
actions in parallel). It is therefore not surprising to find self-stabilizing daemon 
emulation algorithms similar to the ones in this paper fTnu^rrsirmrrru TT^. 

The line topology protocol presented in Section 0 resembles algorithms given 
in usiini. We conjecture that our analysis of traversal time can be applied 
to these daemon emulators. The combination of “maximal concurrency” com- 
plexity m and the causal dependency of the algorithm could give insight for 
measuring round-based complexity of these emulators. 
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Abstract. A data structure is stabilizing if, for any arbitrary (and possi- 
bly illegitimate) initial state, any sequence of sufficiently many operations 
brings the data structure to a legitimate state. A data structure is avail- 
able if, for any arbitrary state, the effect of any operation on the structure 
is consistent with the operation’s response. This paper presents an avail- 
able stabilizing data structure made from two constituents, a heap and 
a search tree. These constituents are themselves available and stabilizing 
data structures described in previous papers. Each item of the composite 
data structure is a pair (key, value), which allows items to be removed 
by either minimum value (via the heap) or by key (via the search tree) 
in logarithmic time. This is the first research to address the problem of 
constructing larger data structures from smaller ones that have desired 
availability and stabilization properties. 



1 Introduction 

Availability is an important topic in online system design. Ideally, a system 
should respond to requests in a timely manner in spite of hardware failures, 
bursts in load, internal reconfigurations, and other disruptive factors. The usual 
technique for ensuring availability is to engineer a system with sufficient redun- 
dancy to overcome failures and resource shortages usmaEi 

One attraction of self-stabilization is that it does not require the traditional 
type of resource redundancy to deal with faults. The question then comes to 
mind, can self-stabilization be enhanced to support system availability? We be- 
gin to address this question with a low-level task, which is to consider data 
structures. Since data structures are frequently used in software for systems, the 
key question is: can data structures that support availability in spite of tran- 
sient failures be constructed? The answer is not obvious, since most operations 
on data structures either abort, get caught in a loop, throw exceptions, or have 
unpredictable behavior when internal variables have invalid values. Specifically, 
we study one data structure in this paper, a composite data structure made 
from a heap and a search tree. This data structure is of interest because it shows 
how one available data structure can be constructed from smaller, available data 
structures (of course, general compositional methods are the ultimate goal, but 
examples are helpful to understand the technical difficulties). 

* This research is sponsored by NSF award CAREER 97-9953 and and DARPA con- 
tract F33615-01-C-1901. 
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The literature of self-stabilization differs from our treatment in several ways. 
First, we do not consider a distributed system, which is the normal model for 
self-stabilization P|. Second, most self-stabilizing algorithms make no guaran- 
tees about behavior before the system reaches a legitimate state — whereas 
we require some guarantees for availability. Third, the data structure model of 
operations in this paper restricts transient faults to effect only the variables of 
the data structure and not the internal variables of an operation underway (in 
principle, our results could be extended to allow such corruption as well). Our 
data structure results are stabilizing in the following sense. Initially, the content 
of a data structure may be arbitrary and corrupt. During some initial sequence 
of operations on the data structure, the response time of each of these opera- 
tions could be abnormally large, but not larger than that for a legitimately full 
data structure; and some operations during this sequence could respond with 
errors, such as reporting that an item could not be inserted, even though the 
data structure is not full. In all cases, however, the response is consistent with 
how the operation changed the data structure. Finally, after a sufficiently long 
sequence of operations, the data structure’s state is legitimate, and thereafter 
all operations have normal running times and responses. 

While there are numerous studies of fault-tolerant data structures, the fault 
model for these studies does not consider recovery from unlimited transient faults 
in the data structure. Self-stabilization is required to deal with unlimited tran- 
sient faults, yet very few papers in the area of self-stabilization treat data struc- 
tures in the model of operations applied to the structures. Two works [ 721110 ] 
mention self-stabilization the context of objects that undergo operations before a 
legitimate state is reached. The challenge of cni is that operations are concurrent 
and wait-free, which is an issue beyond the scope of our present investigation. 
The notion of an available and stabilizing data structure is new in 0, which 
presents an available, stabilizing, binary heap. An available and stabilizing 2-3 
tree is described in 0. 

The data structures investigated in this paper and related papers [SlEl are 
fixed-capacity structures. The reason for fixing the capacity is to include the 
design of dynamic storage allocation in the implementation. Standard texts pre- 
senting data structures may gloss over the role of allocation in dynamic struc- 
tures, but for the case of stabilization, an incorrect initial state can make the 
storage allocation pool appear empty even though the data structure contains 
very few items. The situation is even more complex when numerous data struc- 
tures share a common allocation pool. At this stage of research, we investigate 
single data structures, hence the constraint of fixed capacity to make the pre- 
sentation self-contained. Also, the data structure operations considered here are 
single-item operations (insert, delete, find); operations such as set intersection 
or union are not investigated. Many research directions thus remain unexplored, 
including inter-structure operations, data structures of unlimited capacity, and 
shared allocation pools (and of course, the question of concurrent operations). 

Organization of Paper. Section Q describes the model of data structures and op- 
erations, and then defines the availability and stabilization precisely. Section 0 
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presents a simple example of an data structure and briefly reviews the construc- 
tions for heap and search tree. Section ^ defines the main problem we consider, 
and sketches an impossibility result related to this problem. Section 0 gives an 
overview of the main construction. Detailed correctness proofs of the construc- 
tion are omitted from this extended abstract. The paper ends with discussion in 
Section 0 



2 Availability, Stabilization, and Operations 

A data structure is an object containing items. The object is manipulated by a 
fixed set of operations. Each type of operation is defined by a signature (opera- 
tion name and invocation parameters) and a set of responses for that signature. 
The relation between a signature and its response is specified by sets of sequen- 
tial histories of operations applied to the data structure. That is, we specify 
operation semantics by a collection of legal sequences rather than by present- 
ing pseudocode. We do this in order to maximize the freedom of the object’s 
implementor to choose data representation and algorithms. 

A history is an infinite sequence of pairs {{opi res\) (pp 2 res^) ■ • •) where 
opi is the f-th operation invocation (including its parameters) and resi is the re- 
sponse to opi . A point in a history refers to the state of the data structure either 
before any operation or between two operations opi and opi+i. The content of 
the data structure at a point t is defined directly if t is the initial point of the 
history; if t is not the initial point, then the content is defined in terms of the 
sequence of operations and responses leading up to point t. Let Ct denote the 
content at point t. Let \Ct \ denote the number of items in the data structure at 
point t. All data structures have a fixed capacity, which is an upper bound on 
the number of items the data structure is allowed to contain. 

We illustrate content specification by history with a small example, a data 
structure for a set of at most K items with insert and remove operations. Let 
the content of the set be empty at the initial point of any history. The content Ct 
at point t following operation opt is defined recursively: if opt is an insert(a;), 
and X ^ Ct-i A jCt-il < K, then Ct = (7t_iU{a;} and the response to the oper- 
ation is ack] if opt is an insert(a;) and |Ct_i| = K V x G Ct-i, then Ct = Ct-i 
and the response to the operation is full if |Ct_i| = K or ack otherwise. If opt 
is a remove(a:), then Ct = Ct-i \ {a:} and the response is ack. 

Throughout the paper we assume the following for data structures and their 
operations. Operations are single object methods and we suppose that all oper- 
ations in a history are operations on the same data structure (so the example 
above cannot be extended to allow set union and intersection operations, which 
combine objects). An operation opt is called moot if Ct = Ct-i- We classify each 
operation opt as either successful or unsuccessful depending on its response, and 
this classification is specified as part of the suite of operations for a data struc- 
ture. In this classification, all unsuccessful operations are moot, but the converse 
need not hold. 
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The intuition for “unsuccessful” classification is that, although the operation 
is guaranteed to be moot, the response of the operation may not be trustworthy 
if the history begins with a data structure damaged by a transient fault. We 
formalize this below in the definitions of availability and stabilization. For the 
set example above, only the response full could be classified as an unsuccess- 
ful operation; if full is an unsuccessful response, then intuitively, in a damaged 
state, an insert operation that responds full is allowed to do so even though 
the set contains fewer than K items. 

A well-formed history is one where the responses of all operations agree with 
the definition of content for the data structure. A detailed specification of what 
is means for a history to be well-formed depends on the semantics for the data 
structure operations, their responses, and the definition of content. A legitimate 
history is a well-formed history so that the operation running times are within 
the bounds, as a function of content, given by the implementation for that data 
structure. For example, suppose the set data structure has an implementation 
where the running time for an operation on a set of size m is linear in m. An 
example of an illegitimate history is one where an insert(a:) operation fails 
at some point t even though \Ct\ < K. Another example of illegitimacy is a 
remove(a;) operation that has 2™ running time at point t although \Ct\ = m. 
Observe that if all moot operations are removed from a legitimate history, the 
result is either a legitimate history or a (finite) prefix of a legitimate history. 

Given history H, let Ht denote the suffix of H following point t. Let U{I), 
for a history I or segment / of a history, be the sequence obtained by remov- 
ing all unsuccessful operations from I. A history H is available if there exists 
a point t and a sequence of operation invocations P such that P o U{Ht) is a 
well-formed history or a (possibly empty) finite segment of a well-formed history 
(o is the sequence catenation operator); also, the running time of any operation 
in H is no more than the worst-case running time of any operation, taken over 
all legitimate histories (usually this worst-case running time is obtained for an 
operation on a full data structure). Because unsuccessful operations are moot, 
it follows that legitimate histories are available histories. Examples of histories 
that are available but not legitimate include histories with operations whose 
running times larger than one would expect for the content, and operations that 
return unsuccessful responses in unexpected cases (e.g. an insert(x) responds 
full even though the content has less than the data structure’s capacity number 
of items). An implementation of a data structure is available iff all its histories 
are available. A trivial implementation of a set to guarantee availability would 
be to have any remove do nothing to the data structure and respond ack in 0(1) 
time, and to have any insert do nothing to the data structure and respond full 
in 0(1) time. To see that this implementation is available, take any history PI, 
let t be the initial point of H, let P be empty, and choose Ct to be the empty 
set. Then P o U{Ht) is a legitimate history since it has no insert operations 
and remove operations are moot. 

A history P[ is stabilizing if there exists a point t and a sequence of operation 
invocations P such that P o is a legitimate history. In the definition of a 
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stabilizing history, there can be many choices for the point t and prefix P to 
make P o Ht & legitimate history; call t the stabilization point if there exists no 
point s preceding t such that Q o Hg is a legitimate history for any prefix Q. An 
implementation of a data structure is stabilizing iff all its histories are stabiliz- 
ing. The stabilization time of a stabilizing history H is the number of operations 
preceding its stabilization point. The stabilization time of an implementation of 
a data structure is the maximum stabilization time of all of its histories (if there 
is no maximum, then the stabilization time is infinite). 



3 Available and Stabilizing Components 

Before presentation of the composite data structure in Section 0 we review first 
the constituent data structures used to build the composite. Some constituents 
are trivially available and stabilizing: for instance, we may regard an atomic 
variable (word of memory) to be a data structure supporting read and write 
operations. It is simple to show that such variables are available and stabiliz- 
ing. Interesting data structures such as heaps and search trees require nontrivial 
constructions to satisfy availability and stabilization. Between the trivial case of 
a variable and more advanced data structures, we consider first in this section 
two elementary data structures, a queue and a type of stack, to illustrate some 
challenges of implementing availability and stabilization. 



3.1 Queues, Stacks, and Conservative Implementations 

Two elementary examples of available and stabilizing data structures are de- 
scribed here, a queue and a type of stack. We then show two variations on the 
queue that do not have available and stabilizing implementations. These vari- 
ations on the queue are somewhat artificial, but they are useful to illustrate 
difficulties that arise later in the presentation of a composite data structure. An 
important concept introduced in this section is a conservative implementation 
of a data structure. Informally, an implementation is conservative if operations 
cannot substitute values for missing data. 

A K-queue is a queue with a capacity of K items supporting enqueue and 
dequeue operations. The response to an enqueue(a;) operation is either ack or 
full, and the response to a dequeue is either an item or empty. Only the full 
response is classified as an unsuccessful response. The content Ct at any point t 
can be defined from the sequence of successful enqueue operations up to point 
t and the sequence of dequeue operations that return items up to point t (we 
omit the formal definition). In any legitimate history, the running time of an 
operation is 0(1). In any well- formed history, responses have the following prop- 
erties. An enqueue operation at point t responds full iff \Ct \ > K, and otherwise 
responds ack] a dequeue at point t responds empty iff \Ct\ = 0 and otherwise 
responds with the initial item of sequence Ct ■ 

The conventional implementation of a bounded queue by a circular array 
is an available and stabilizing if-queue. Let A[0..(AT — 1)] be an array of K 
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items and let head and tail be two integer variables. The it'-queue is empty if 
(head mod K) — (tail mod it'), and full if ((tail + 1) mod K) — (head mod 
K). An enqueue(a;) succeeds iff the if-queue is non-full by writing x to A[(tail-|- 
1) mod K] and then assigning tail ^ (tail -|- 1) mod K (the capacity of this 
queue representation is thus K —1 items). The dequeue is similarly defined, with 
the result that the state of the AT-queue is well-defined for all possible values of 
head, tail, and A. By straightforward expansion of the definitions, availability 
and stabilization hold for this implementation and the stabilization time is zero. 

The composite data structure described in Section El makes use of a relative 
of the A'-queue. A lossy K-stack is a stack with a capacity of K items and push 
and pop operations. A push operation for the lossy AT-stack always succeeds and 
a pop operation always returns an item. The AT-queue can be implemented by 
an array of AT elements and a top variable to contain an index for the top of 
the stack, which is incremented modulo K by push and decremented modulo AT 
by pop. This stack is “lossy” because, for AT -|- 1 consecutive push operations, 
the last push writes over the oldest item on the stack. We omit the straightfor- 
ward details of the formal definition of this data structure and verification of its 
availability and stabilization properties. 

The elementary examples of queue and stack become more challenging when 
there are domain restrictions on the items contained in the data structure. Con- 
sider a queue that may only contain (pointers to) prime numbers as items. The 
enqueue operation can return a new response err: the response to enqueue(a;) is 
err for any nonprime x, and the operation is moot. The running time of enqueue 
is no longer constant, because enqueue tests its argument for primality. The 
dequeue operation has constant running time, either returning (a pointer to) an 
item or the empty response. Now suppose we have an implementation of this data 
structure and let a be the state for some queue of prime numbers. It is possible 
for a transient fault to transform a into some a' by changing some (pointers to) 
items in the queue from prime to nonprime values. This is a problem because the 
dequeue operation can now return a nonprime number. Returning a nonprime 
number is not possible in a legitimate history and because an item in response to 
dequeue is a successful operation, there is a conflict with availability if dequeue 
is allowed to return a nonprime number. The only resolution is for dequeue to 
check an item for primality before responding. We do not know of any method 
to test primality in constant time, so dequeue will require nonconstant running 
time for faulty states such as cr'. If the implementation is also stabilizing, then 
every history has a suffix where all dequeue operations have constant running 
time. Yet it is impossible for dequeue to distinguish between correct and faulty 
states in constant time unless there is a constant-time primality test. Therefore, 
every dequeue operation on a nonempty queue checks for primality, which con- 
flicts with the running time constraint for a stabilizing implementation. We have 
thus sketched the proof of the following. 

Lemma 1. No implementation of the prime queue is available and stabilizing. 

Another difficulty with a domain-restricted queue is shown by a queue that 
may only items of the form a:b with a an even number. Again, execution of 
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enqueue(x;j/) reponds with err for odd x, but this can be checked in constant 
time. The dequeue operation can also verify that the head item has an even 
first component in constant time, so all operations run in constant time for a 
legitimate history. After a transient fault, there could be queue items with odd 
first components, so the question arises, what should dequeue do if it detects 
that the head is odd? One choice would be to respond with 0:z in place of any 
item r:z with odd r found at the head of the queue. This choice would satisfy 
availability, because there is a legitimate history where zero is enqueued in place 
of any odd value (notice this method cannot be used for the prime number queue 
because of running time constraints) . If there are further domain restrictions on 
items, say a relation between a and 6 in a pair a:b, or perhaps relationships with 
other variables outside the queue, then the substitution of 0:z for r:z would not 
be valid. 

We call an implementation conservative if no operation returning a data 
structure item is allowed to invent the item — a conservative implementation 
can only return items that are present in the data structure, and cannot coerce 
invalid values to legal ones. This is a key decision in our presentation, and differs 
from classical work on self-stabilizing control structures. For classical problems 
such as mutual exclusion, it does not matter how illegitimate states are converted 
into legitimate ones, but for the stabilization of data structures, we prefer to limit 
techniques to conservative measures where data items are not created to satisfy 
a domain constraint. The motivation for conservative operations is not just to 
make a theoretical problem nontrivial: in practice, stabilizing algorithms that 
limit the effect of faults are preferable to those that do not enforce any such 
limit PIIOIII]. While it is true that a transient fault could inject apparently le- 
gal values never actually inserted in a data structure, such transient faults are 
uncontrolled, whereas operation implementation can be designed to avoid the in- 
jection of artificial values (moreover, practical techniques such as error detecting 
codes can decrease the probability of legal values injected by faults). 

Lemma 2. No conservative implementation of the x:y queue is available and 
stabilizing. 

The lemma can be shown by an adversary argument: each dequeue has to dis- 
tinguish, in 0(1) time, whether the queue is empty or has an item meeting 
the domain constraint, and for whatever strategy dequeue uses to examine the 
queue, an initial state can be constructed to defeat that strategy. 



3.2 Heap and Search Tree Review 

The heap and search tree data structures are more complicated than the queue, 
and items in the heap and search tree have domain restrictions, unlike a simple 
queue. In the heap, an item’s value is the least of a subtree of values, and in 
a search tree, item keys are ordered. There is, however, a crucial way in which 
these domain restrictions are simpler than the example of the prime number 
queue: there is an implementation so that all operations access items via a path 
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from the tree’s root. This fact generalizes to the observation that, for any val- 
ues of items in a such a tree, there is a maximal subtree for which the domain 
restrictions hold. For the heap this is called the active heap and for the search 
tree this is called the active tree. 

The available and stabilizing search tree given in Q is a 2-3 tree. Its active 
tree is defined as the maximal subtree so that the distance from root to leaf is 
the same for all leaves, all keys are in order, and several other validity conditions 
hold for child and parent pointers. For the available and stabilizing heap 0, the 
active heap is taken to be the maximal rooted subtree so that each node’s value is 
a lower bound on the values of its children. The table below shows the signatures 
and responses for the 2-3 tree and heap structures. The method to ensure avail- 
ability is straightforward: all data structure operations are implemented with 
respect to the active structure. A consequence of this method is that the active 
structure can be unbalanced, so that operations on the active tree may have 
no longer have running that is logarithmic in the number of active items. The 
remaining implementation task is stabilization, which entails balancing the ac- 
tive structure. Informally, the balancing of the active structure is a “background 
activity” similar to garbage collection in memory allocation schemes. This back- 
ground activity is also needed to repair pointers, repair free storage chains, and 
correct various other internal variables of the data structure. Because no actual 
background process is assumed by the model of data structures, each operation 
on the data structure invokes a limited amount of background processing. The 
running time of any call to background processing is 0(lg K) in in an arbi- 
trary state and O(lgn) after the stabilization point, where K is the capacity of 
the data structure and n is the number of items in the active structure. These 
running times are the same as the operation complexities, since in the worst 
case, an active heap or search tree encounters a path of length O(lgiF), and 
when the structure is balanced, all paths are O(lgn). 



signature 


successful 


unsuccessful 


illegitimate 


legitimate 


STinsert(/c, d) 


ack 


full 


0{lgK) 


O(lgn) 


STdelete(/c) 


ack 




0{lgK) 


O(lgn) 


STfind(fc) 


(fc, d) / missing 




0{lgK) 


0(lgn) 


Hinsert(t;, e) 


ack 


full 


0(lgK) 


O(lgn) 


Hdeletemin( ) 


{v, e)/ empty 




0{lgK) 


O(lgn) 


Hdelete(p) 


ack 




0{lgK) 


O(lgn) 



4 Composite Data Structure 

The composite data structure presented here is called the heap-search tree. Defi- 
nitions of the heap-search operations are explained in this section; the construc- 
tion of the heap-search tree is presented in Section O The table below presents 
our initial table of the signatures and responses of the operations; later in this 
section we revise this table, after presenting an impossibility result for a conser- 
vative implementation. 
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signature 


successful 


unsuccessful 


illegitimate 


legitimate 


insert(/c, v) 


ack 


full 


0{lgK) 


O(lgn) 


delete(A:) 


ack 




0{lgK) 


O(lgn) 


f ind(fc) 


{k, v) / missing 




0{lgK) 


O(lgn) 


deletemin( ) 


(fc, v)/ empty 




C{lgK) 


O(lgn) 



Each item of the heap-search tree is a pair (k,v), and the content of the 
heap-search tree at any point is a multiset (bag) of such pairs. Below we use 
union (U) and subtraction (\) for multiset operations, so C\{(fc,u)} removes at 
most one copy of (k,v) from multiset C. Let s and t be consecutive points in a 
legitimate history and let the operation and response occur between s and t. Op- 
eration insert(/c,u) is moot and responds full if jCsl > K; otherwise |Cs| < K 
and insert(/c,u) responds ack, with Ct = CsU{(A:, u)}. Operation delete(A:) re- 
sponds with ack; the operation is moot if there exists no b satisfying {k, v) S Cg, 
otherwise Ct = Cg \{{k, u)} for some v satisfying {k, v) G Cg. Operation f ind(A:) 
is always moot, and either returns a pair (fc, v) for some b satisfying {k, v) G Cg, 
or returns missing if no such pair exists. Operation deletemin( ) returns empty 
and is moot if |Cg| = 0, otherwise the response is a pair {k, v) such that v is the 
minimal value for any pair’s second component, and Ct = Cg \ {(A:,u)} in this 
case. 

Before making statements about heap-search tree implementations, we first 
formalize what it means for an implementation of this data structure to be a 
composite of the heap and 2-3 tree. Informally, the implementation is a compos- 
ite if it is a construction made by assembling one heap and one 2-3 tree, so that 
any pair (fc,u) G Cg is represented by the item k in the 2-3 tree at point s and 

V in the heap at point s, and no other data structures are used in the storage 
of items. More formally, the implementation of the heap-search tree satisfies the 
following three constraints: (i) the composite heap-search tree implementation 
has one 2-3 tree S, one heap T, and possibly other data structures used for back- 
ground processing; (ii) at any point s in a history, (k, v) G Cg iff (fc, d) G Sg and 
(u, e) G Eg, where Sg and Tg respectively denote the contents of the 2-3 tree and 
heap at point s, with d and e being associated data as defined by operations; 
and (Hi) for any (k,v) G Cg, the key k is contained only in the 2-3 tree — keys 
are not contained in any structure other than the 2-3 tree; similarly, the value 

V is not contained in any other structure than the heap. Note that (Hi) could 
be ambiguous for a structure with integer keys as well as integer heap values, 
because a key could coincidentally reproduce a heap value. To resolve this ambi- 
guity, assume that keys and heap values are taken from different types (say key 
and value) for purposes of defining constraint {Hi). 

Lemma 3. No conservative implementation of the composite heap-search tree 
using the heap and 2-3 tree is available and stabilizing. 



The conclusion we draw from LemmaEJis that either we should settle for an 
implementation that is not conservative, or the operations should be modified. 
The proof of Lemma 0 (omitted from this extended abstract) points out that 
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deletemin is responsible for the difficulty, so we propose the following change: 
let deletemin have a new response err, to be returned if the value returned 
by Hdeletemin does not have a corresponding key in the 2-3 tree. This change 
satisfies availability by classifying any deletemin with an err response as an 
unsuccessful operation. However to satisfy stabilization, we shall require that no 
deletemin be unsuccessful after the stabilization point, since the err response 
does not occur in a legitimate history. For application purposes, err is useful 
because it informs the deletemin caller that something is incorrect in the heap- 
search data structure, but by returning within 0(lgK) time, quickly gives the 
application the choice of repeating deletemin (and progressing toward stabiliza- 
tion) or using some other type of recovery. Lemma 0 can also be proved using the 
find operation instead of deletemin, because the search tree may contain dupli- 
cate keys in an initial state; the delete has another similar difficulty. Therefore, 
for the remainder of the paper, let delete, find, and deletemin return err if 
the operation is unsuccessful. 

5 Heap and 2-3 Tree Composite Construction 

5.1 Variables, Constituent Structures, and Pointers 

The composite data structure is composed of two binary variables STbit and 
Mbit, a variable curcolor with domain {0,1,2}, a Lf-lossy stack, a heap, and 
a 2-3 tree. As explained earlier, all of these constituents (variables and struc- 
tures) are available and stabilizing components. Throughout the remainder of 
the paper, K is the capacity of the heap-search tree and also the capacity of 
the heap and 2-3 tree. For convenience, we use the term nextcolor to mean 
(1 -|- curcolor) mod 3 and the term prevcolor to mean (2 -|- curcolor) mod 3. 

Our construction uses pointers to connect heap items and 2-3 tree items. The 
use of pointers in conventional random access memory is challenging because 
damaged pointers can lead to further damage in data structures (for instance, 
modifying data accessed by a damaged pointer). We make some assumptions 
concerning how data, especially the heap and 2-3 tree, are arranged in memory. 
We assume that a procedure Hpointer(p) evaluates p and in 0(1) time returns 
true if p could be an item of the heap, and false otherwise. An implementation of 
Hpointer(p) could check that p is an address within a range [H start, Hend] de- 
fined for heap items, and also check that p is a properly aligned address (H start 
and Hend are program constants not subject to transient fault damage). Simi- 
larly, we assume there is a procedure STpointer(p) to determine whether p can 
be a 2-3 tree item. 

The next level of pointer checking is to determine whether or not a given p 
is a pointer to an item in the active structure. Let STintree(p) respond true if 
p is a pointer to an item in the active 2-3 tree, and false otherwise. The run- 
ning time of STintree(p) is 0{h) where h is the height of the active 2-3 tree. 
The implementation of STintree(p) could be similar to one described in |0|; 
after using STpointer(p) to validate candidacy of p, the procedure follows par- 
ent pointers of tree items to verify that the tree’s root is an ancestor, and also 
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verifies properties of keys in items are such that all items in the path are in the 
active 2-3 tree. A similar procedure Hinheap(p) determines whether p is an item 
of the active heap. Hinheap(p) can be implemented by tracing the parentage of 
p back to the heap’s root. Unlike STintree(p), which consumes 0{h) time, the 
complexity of following p’s parentage for arbitrary p satisfying Hpointer(p) is 
0(lg K) — the running time is 0{h) if p is an item of the active heap. In some 
cases, the time bound for Hinheap(p) should be constrained, even if the result 
is inaccurate. Let Hinheap(p, t) be an implementation that limits the parentage 
trace to t iterations, and if t iterations do not suffice to reach the heap’s root, 
Hinheap(p, t) returns false. 

Each item of the iC-lossy stack is a pointer. Items of the heap and 2-3 tree 
provide for a data field in the respective insert operations (see (fc, d) and (v, e) 
in the table defining operations), and we use these data fields for pointers. For an 
item (fc, d) of the 2-3 tree, c? is a pointer. For an item (v, e) of the heap, e = {q, c) 
with q being a pointer and c being a “color” in the range {0,1,2}. Our convention 
is to refer to d as the pointer associated with the 2-3 tree item {k, d), to call q 
the pointer associated with heap item (v,e), and to call c the color associated 
with (v,e). We also use the notation x. color to refer the color of an item x. 

At any point in a legitimate history, the pointers associated with items in the 
active heap and active 2-3 tree should bind a pair (k,v), which means that the 
pointer d associated with k in the 2-3 tree refers to an item with value v in the 
heap, and in turn the pointer q associated with v refers to the item (k,d). Let 
STcross(p), for p a pointer to a 2-3 tree item (k,d), be a boolean function re- 
turning true only if Hpointer(d) is true there is a pointer q associated with the 
heap item of d, and q = p. A similar function Hcross is defined for heap items, 
and we generically use the term crosscheck relation to mean that the pointer as- 
sociated with an item, heap or 2-3 tree, refers to an item in the other structure 
that has the expected back pointer; we say that two items crosscheck if they 
satisfy the crosscheck relation. Note that the crosscheck relation can be checked 
in 0(1) time, but satisfying the crosscheck relation does not imply membership 
in the active heap or 2-3 tree. 

5.2 Modifications to Constituent Operations 

We change the operations of the heap and 2-3 tree only by adding a some extra 
steps to look after the pointers and the color field associated with a heap item. 
The reason that pointers are an issue is that items can change location inside 
a data structure as a result of insertions or deletions. For the 2-3 tree, this can 
occur by node splitting or merging. For the heap, this occurs by item swaps as 
part of “heapify” routines to restore the heap property after an item is removed 
or added. 

The change relating to pointers is: whenever an operation moves an item x, 
the operation validates a: by a crosscheck, and if the crosscheck holds, then the 
operation adjusts pointers so that the crosscheck relation will be hold after the 
item moves. Conversely, if the crosscheck relation does not hold for x prior to 
the move, then the pointer is forcibly invalidated so that crosscheck will not hold 
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(accidentally) as a result of moving x. This change applies to both heap and 2-3 
tree operations, since procedures of both can move their respective items within 
the active structures. Background activities that move items are also changed to 
attend to item movement. In particular, the background heap activity includes 
an balance routine that deletes an item of maximum depth, reinserting it at 
minimum possible depth; the movement of this item by balance requires pointer 
adjustment so that subsequent crosschecks are valid. 

The change relating to colors only applies to heap operations and back- 
ground activities. Whenever an operation or background routine examines an 
active heap item with color c, it pushes a pointer to that item on the itl-lossy 
stack if c = nextcolor, and then assigns the item’s color to be the value of 
curcolor. A single operation or background routine for the heap may examine 
many nodes (for instance, examining all the nodes along a path from root to a 
leaf), so an operation can push many pointers on the stack as a result of this 
change. However, since checking the color field and stack pushes are 0(1) time 
steps, the operation complexities of the heap are unchanged. 

Two background activities of the constituent structures have new duties in 
the composite data structure. First, we review terminology for the existing back- 
ground operations. The heap has a background operation called Hscan and the 
2-3 tree has a similar operation called STscan. One intent of these background 
operations is the same for both structures, which is to examine a path of nodes 
from root to leaf, possibly truncating nodes that are not part of the active struc- 
ture, correcting variables within nodes, and in the case of the 2-3 tree, possibly 
merging nodes. One Hscan invocation can examine up to IgK items, since each 
node in the active heap contains an item. One STscan invocation examines two 
items by looking at two paths from root to leaf (only the leaves of the active 
2-3 tree contain items). In any sequence of operations on a structure the paths 
chosen by Hscan or STscan advance through the tree in a standard “left to right” 
order. The path chosen in an initial state is unpredictable, but after examining 
the rightmost path, the next invocation returns to the leftmost path, so that all 
nodes will be examined in any sequence of (sufficiently many) operations. 

For the composite data structure, we add a step to Hscan in only one occa- 
sion, which is immediately after the rightmost path of the heap is examined: the 
assignment Hbit /H(Hbit, STbit) is executed, for 



fn{x,y) 



( X if X = y 
(1 — a;if X ^ y 



If Hbit = I A STbit = I as a result of this assignment, then the assignment 
curcolor <— nextcolor is executed. 

One change to STscan resembles the change to Hscan: immediately after 
STscan examines the rightmost path of the 2-3 tree, the assignment STbit ^ 
/si(Hbit, STbit) is executed, where 



fsT{x,y) 



y if a; yf y 

1 — y if X = y 
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The remaining change to STscan adds extra work in the examination of a 2- 
3 tree item. When STscctn examines an item (k,d), the procedure crosschecks 
{k,d)-, if the crosscheck indicates that k does not have a corresponding heap 
value, then STscan removes (fc, d) from the active 2-3 tree. If the crosscheck test 
passes, STscan then evaluates Hinheap(d), and removes (k,d) from the active 
2-3 tree if Hinheap(d) is false. Finally, if {k, d) passes crosscheck and Hinheap 
tests, then STscan assigns c <— curcolor where c is the color variable of the 
heap item associated with k. 

5.3 New Background Activities 

Each invocation of a heap operation contains a call to Hsccin. Each invocation of 
a 2-3 tree operation contains a call to STsccin. The modifications of these back- 
ground activities described in Section K.2I supply most of the effort needed to 
stabilize the composite structure; only one additional, new procedure is needed. 
We call this new routine trimheap. Procedure trimheap is called once in the exe- 
cution of any of the composite operations, and consists of steps shown in Figure 
m By reasoning about the running times of Hpointer, crosscheck, STintree, 
Hinheap, and Hdelete, it follows that any call to trimheap requires at most 
0{lgK) time at an arbitrary state and O(lgm) time at a legitimate state for a 
heap-search tree containing m items (at a legitimate state, the heap’s root has an 
accurate height field, which then is used to limit the running time of Hinheap). 

We also suppose that each of the composite operations include calls to STscan 
and Hscan. Such calls will already be included whenever an operation invokes one 
of the appropriate constituent operations, however not all composite operations 
invoke constituent operations on both heap and 2-3 tree structures. The find 
operation, for instance, invokes STf ind but does not include any heap operation; 
and although deletemin invokes both Hdeletemin and STdelete in a legitimate 
history, it may not do so before the stabilization point, as we show in Section 
Therefore we suppose that trimheap includes calls to Hsccin and STscan 
as needed, to ensure these constituent background activities execute with each 
operation in any history. 

5.4 Operations of the Composite 

The four operations of the heap-search tree have psuedo-code listed in Figure [D 
We suppose in this figure that if STf ind(fc) returns an item (a, b), then a pointer 
to item (a, 5) in the 2-3 tree component is available or can be obtained in 0(1) 
time. Also, for the sake of brevity, we do not provide details for how to deal 
with duplicate key values; we assume that a STdelete(fc) will remove the item 
return by a STf ind(fc) immediately preceding, and suppose similar behavior for 
insertions. 

The usage of crosscheck in the code for deletemin bends our earlier expla- 
nation of checking item pointers. This crosscheck(j)) should check that the 2-3 
tree item referred to by p is an item {k,d) such that d points to the root of the 
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trimheap 

q «- pop( ) 

if (^Hpointer(g) V ^crosscheck(q)) return 
p ^ search tree item for q 
if ^STintree(p) return 
t ^ height field of heap’s root 
if Hinheap(g,t) then Hdelete(q) 


f ind(fc) 

if (STfind(fc) = missing) 
return missing 
(a,b) ^ STfind(fc) // a = k 
p <— address of item (a, b) 
if (STcross(p) A Hinheap(6)) 

(v,e) <— heap item via pointer b 
return {k, v) 

STdelete(fc) // delete invalid key 
return err 


delete(fc) 

if (STfind(fc) = missing) return ack 
(a, b) r- STf ind(fc) / / a~k 
p ^ address of item (a, 6) 
if (STcross(p) A Hinheap(&)) 
Hdelete(6) 

STdelete(fc) 
return ack 

STdelete(fc) // delete invalid key 
return err 


deletemin( ) 
t r- Hdeletemin) ) 
if (t = empty) return empty 
{v, (p,c)) ^ t 

if (crosscheek{p) A STintree(p)) 

{k, d) ^ 2-3 tree item via pointer p 
STdelete(fc) // delete (k,d) 
return {k, v) 
return err 


insert(fc, v) 
d <— temporary value 
if (STinsert(fc, d) = full) 
return full 

II locate ST item just inserted 
(a, &) ^ STfind(fe) 
t ^ address of item (a, b) 
e <— (t, curcolor) 
if (Hinsert(u, e) = full) 
STdelete(fc) // backout 
return full 

replace b within 2-3 item by 
b <— address of (v,e) in heap 

return ack 



Fig. 1. trimheap procedure and composite data structure operations. 



heap — since the heap item t returned by Hdeletemin was located at the root 
of the heap prior to being removed. 

5.5 Verification 

For a given point t in a history of operations on the heap-search tree, let H( 
be the bag of items in the active heap and let STj be the bag of items in 
the active search tree. Let At denote the active multiset at point t, defined 
hy At = { {k, v) \ k G ST^ A u G Ht A (fc, v) satisfy the crosscheck relation }. The 
expressions |Ht| and |STt| the number of items in the (respective) active struc- 
tures. A state of the composite data structure is ST-legitimate if the search tree 
component of the state is legitimate as defined in [2(. A state of the composite 
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data structure is H-legitimate if the heap component of the state is legitimate 
as defined in 0. A state is ST/H-legitimate if it is both ST-legitimate and Il- 
legitimate. A state is legitimate if (*) it is ST/H-legitimate, (ii) for each item x of 
the active search tree, there exists an item y of the active heap such that x and 
y are related by the crosscheck relation, and (conversely) (Hi) for each item a of 
the active heap, there exists an item 6 of the active search tree such that a and 
b are crosscheck related. The availability and closure properties, based on these 
definitions, have simpler proofs of correctness than the proof of convergence. To 
show convergence, we define a segment of a history to be a eolor phase if the 
curcolor value is the same at each point in the segment. Define eolorsafe to 
be the weakest predicate closed under heap-search operations such that for any 
point s where eolorsafe holds, the state of the data structure is ST/H-legitimate 
and (Va; : x € : x. color nextcolor). 

Lemma 4. Let s be an ST/H-legitimate point in a history and let mg = 
max(|Hs|, |STs|). The color phase containing point s terminates at some point 
w after at most 15 tos operations and max(|H^|, |STu,|) < 16 tos- Following any 
point s at which ST/H- legitimacy holds and rris = max(|Hs|, |STs|), there occurs 
a point t within 0(m) operations of any history such that the data structure at 
point t is eolorsafe and mt = max(|Hs|, |STs|) = 0(m). 

For any point s in a history of operations on the composite data structure, let 
depthg{j) for j G Hg be the number of pop operations required to obtain a pointer 
to j from the lossy stack (depthg{j) = oo if there is no pointer to item j on the 
lossy stack) . Let Hg denote the bag of active heap items that for which the cross- 
check relation does not hold, that is, Hg contains those active heap items that 
do not correspond to any active multiset item. Define stacksafe(rn) to be the 
weakest closed predicate such that any point s where stacksafe(m) holds, the 
state of the data structure is eolorsafe and (Vj : j G Hg : depthg{j) < 3m). 

Lemma 5. Following any eolorsafe point s with rUg = max(|Hg|, |STg|), there 
occurs a point t within 0(ms) operations of any history such that the data 
structure at point t is stacksafe{ms) and m* = max(|Hg|, |STg|) = 0{ms). 

Theorem 1. Within 0{ms) operations of any history the state of the compos- 
ite data structure is legitimate, where rUg = max(|Hg|, |STg|) for the initial point 
s of the history. 

6 Conclusion 

There are likely simpler ways to construct a stabilizing available structure with 
the signatures of the composite presented in Section El (two search trees, an 
augmented tree, etc), however our aim was to investigate the composition of a 
data structure from given components. Although we present a construction, our 
results concerning the larger question are somewhat negative: not everything is 
achievable given constraints of (conservative) availability and stabilization (the 
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possibility of conflict between availability and fault tolerance was observed long 
ago P). 

The reader may wonder whether the case of sequential data structure oper- 
ations, without distributed implementation or concurrency, merits any investi- 
gation: after all, the very title of [2| is “self-stabilization in spite of distributed 
control” (recent definitions of self-stabilization refer only to behavior ^1] and 
care not whether the system is distributed or sequential). Here, the difficulties to 
overcome have to do with two aspects of availability during convergence, namely 
the integrity of operation responses and the time bounds of operations. Both 
aspects have practical motivation: users of data structures and system designers 
of forward error recovery value the integrity of operation responses; and hav- 
ing guaranteed time bounds on operations is useful for the design of responsive 
applications in synchronous environments. It could be interesting to reconsider 
our constraint on the time bound, perhaps allowing some polynomial extra time 
during convergence. 
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Abstract. We present a causal deterministic merge program for publish- 
subscribe systems. Our program ensures that if two subscribers receive 
two messages then they receive them in the same order. Also, it guaran- 
tees that the order in which a subscriber receives messages is a lineariza- 
tion of the causal order among those messages. To develop our program, 
we expect two guarantees from the underlying system: the hrst guarantee 
deals with the difference between physical clocks and the second guaran- 
tee deals with message delays. While O(n^) space is required for causal 
delivery of unicast messages in asynchronous systems, our program only 
uses 0{log n) space. We also show how our program can be made sta- 
bilizing while using only a bounded space. And, the recovery time for 
our program is proportional to the guarantees made by the underlying 
system. 



1 Introduction 

We focus our attention on the problem of causal deterministic merge in a publish- 
subscribe network. A publish-subscribe network consists of a set of publishers 
that publish messages and a set of subscribers that subscribe to a subset of 
those messages. We assume that the subscribers wish to receive messages in a 
uniform total order that conforms to the causal order. The uniform total order 
requires that if two subscribers receive two messages, then they receive them in 
the same order. The causal order requires that if a subscriber receives messages 
mi and m2 such that m2 causally depends on mi then that subscriber delivers 
mi before delivering m2- The causal dependency may occur through published 
messages or through internal communication among publishers. Such causal or- 
der delivery is desirable in several applications. For example, consider a scenario 
where a publisher publishes a question and another publisher publishes its an- 
swer. It is desirable that the subscriber receives the question before receiving the 
answer. 

A uniform total order that conforms to the causal order can be obtained 
by using a centralized solution. However, a centralized solution lacks scalability 
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and reliability. To redress this, we use replicated mergers (as proposed in P). 
Each merger is associated with a set of subscribers. All published messages are 
sent to the (relevant) mergers. The mergers reorder the messages so as to ob- 
tain a causal deterministic order. The deterministic order guarantees that if two 
mergers receive overlapping messages, they order those overlapping messages 
consistently. Moreover, the causal order guarantees that if m 2 causally depends 
on mi then mi will be ordered before m 2 . The reordered messages are then sent 
to the subscribers. 

We build these mergers by extracting two simple guarantees expected of the 
underlying distributed system on which the publish-subscribe network is built. 
The first guarantee is related to the clocks: we require that the system pro- 
vide a bound, say e, such that the clock drift between different processes (that 
implement the publishers, subscribers and mergers) in the system is bounded 
by e. This guarantee can be met by using GPS clocks, network time protocol, 
atomic clocks or clock synchronization programs. The second guarantee relates 
to the message delays: we require that messages that reach their destination do 
so within some bound, say 5. This guarantee can be met by using protocols that 
characterize messages as being timely or late. 

Our solution uses bounded space and provides stabilizing fault-tolerance in 
the presence of faults. It uses 0{elog n + log 5) space and recovers in 0(e-|-i5) 
time from an arbitrary state (that could be reached in the presence of faults 
such as message corruption, transients, temporary violation of system guaran- 
tees). However, for a given system, the space used by our program is bounded 
in that it does not grow as the computation progresses. 

We decompose the problem of causal deterministic merge into two parts: (1) 
how to design logical timestamps that capture causality among messages, and (2) 
how to use the logical timestamps to obtain a uniform total order that conforms 
to the causal order. Specifically, in the first part, we develop bounded-space and 
stabilizing logical timestamps. In the second part, we use the logical timestamps 
to order messages in a uniform total order that conforms to the causal order. 

To develop our solution for logical timestamps, we begin with the following 
observation: Had the underlying system provided a global cloek (e = Q), the phys- 
ieal clock alone would have been sujjicient to implement the logical timestamps. 
We, therefore, develop logical timestamps that consist of the physical clock value 
and some additional information that is proportional to e and 6. Also, the guar- 
antees given by the distributed system are used to add stabilizing tolerance; the 
resulting implementation, thus, ensures that even if the logical timestamp val- 
ues are perturbed, eventually they are restored so that the causality is tracked 
correctly. 

Contributions of the Paper. The contributions of this paper are as follows: 
(1) We present a bounded and stabilizing solution to logical timestamps. The 
space cost for our program is 0{elog n + log 6) and the recovery time for our 
program is 0{e + 5). (2) Using the logical timestamps, we present a bounded 
and stabilizing implementation of causal deterministic merge. If we concentrate 
only on one merger, the causal deterministic merge program also serves as a 
causal delivery program. The state space for our program grows logarithmically 
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in the number of processes. By contrast, the space cost for previous programs 
(e.g., PIEI) is quadratic in nature. Our solution guarantees that if a message 
arrives at its destination within 6 time, it will be delivered within (5+3e. Thus, 
the resulting system is one where the clocks differ by e and the messages arrive 
within time i5 + 3e or are lost. Note that these assumptions are similar to the 
assumptions of the underlying system (with S replaced by (5+3e). We find that 
this observation simplifies the task of designing causal deterministic merge in hi- 
erarchical systems. (3) We present extensions of our program where we identify 
simple conditions to reduce the space complexity further. With these conditions, 
the space requirement is independent of the number of processes. 

Organization of the Paper. In Section El we formally specify our system 
model, guarantees expected from the underlying system, and the types of faults. 
Then, in Section 0 we present our solution to logical timestamps. We use the 
solution in Section 01 to develop our causal deterministic merge program in Sec- 
tion 0 In Sectional we discuss extensions of our program, and identify a simple 
condition under which the message size could be made independent of the num- 
ber of processes. In Section 101 we present the role of our model in each of these 
solutions. Finally, we make concluding remarks and point out future directions 
in Section 0 For reasons of space, we refer the reader to ^ for proofs. 

2 System Model 

A distributed system consists of a finite set of processes which communicate via 
message passing. A process is used to implement a publisher, subscriber or a 
merger. Each process j has access to a physical clock rt.j. We do not assume 
any relation between rt.j and the auxiliary global time, i.e., we do not assume 
that any process is aware of the ‘real’ time. 

We assume that in the absence of faults, the distributed system satisfies the 
following two guarantees, and in the presence of faults, these guarantees may be 
violated only temporarily. (We treat the above guarantees as assumptions that 
system users can make. Hence, depending upon the context, we use the word 
assumption instead of guarantee.) 



Guarantees of the Distributed System. 

Gl. The value of rt.j is non-decreasing, and at any time, the difference 
between the clock values of any two processes is bounded by e. In other 
words, 

Vj, k :: \rt.j — rt.k\ < e 

G2. Let rrij be a message sent by process j to k. Also, let st^ denote the 
clock value of j when j sent mj , and let rdm denote the clock value of j 
when k received rrij. We require that k should receive rrij within time S 
unless rrij is lost. In other words, 

Vto :: {{rdm < {stm + S)) V rdm = oo) 
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Notation. A distributed system instantiated with 

parameters e and S is denoted as ds(e, S). 

Remark. We assume that time is discrete and the value of e is an integer; e 
equals ceilingijnax clock drift/minimum time difference between events). The 
issue of fine-tuning e, S is discussed in Section 

Execution of a process consists of a sequence of events; an event can be a 
local event, a send event, or a receive event. In a local event, the process neither 
receives a message nor sends a message. In a send event, the process sends one 
or more messages, and in a receive event, the process receives one or more mes- 
sages. For simplicity, we assume that, for one clock tick of j, at most one event is 
created at process j. (Note that this assumption can be easily weakened so that 
at most K events are generated for each clock tick, where K is any constant.) 

Notation. In this paper, we use i,j, k, and I to denote processes. We use e and / 
to denote events. Where needed, events are subscripted with the process at which 
they occur, thus, Cj is an event at j. We use m to denote messages. Messages 
are subscripted by the sender process, thus, nij is a message sent by j. 

Fault Model. In our fault-model, messages can be corrupted, lost, or duplicated. 
Moreover, processes could be improperly initialized, and channels may contain 
garbage messages in the initial state. The state of processes could be transiently 
(and arbitrarily) corrupted at any time. Also, the guarantees made by the system 
(G1 and G2) may be temporarily violated. We assume that the number of fault 
occurrences is finite. Our solutions are stabilizing fault-tolerant in the presence 
of these faults, where 

Definition. A program is stabilizing fault-tolerant iff starting from an arbitrary 
state, it eventually recovers to a state from where its specification is satisfied. 

3 Logical Timestamps 

In this section, we present our bounded-state, stabilizing solution for logical 
timestamps. Towards this end, we first precisely define the problem of logical 
timestamps in Section [i. 11 Then, we provide an implementation of the logical 
timestamps in Section l.'t.'Zl Finally, we show how that implementation can be 
made stabilizing in Section IQ 

3.1 Problem Statement 

Towards defining the problem of logical timestamps, we first define the causal 
relation 0, — among events. Then we introduce a definition which will be 
used in the problem statement. 

Happened-Before. The happened before relation ^ is the smallest transitive 
relation that satisfies, for any two events e, /, e — > / if (1) e and / are events 
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on the same process and e occurred before /, or (2) e is a send event in one 
process and / is the corresponding receive event in another process. □ 

Notation. Let ts be a type, and let ts.e and ts.f be values of type ts. Then, 
less{ts,ts) is a function that takes two arguments of type ts and returns a 
boolean. 

Definition. Let ts be a type. A function less{ts,ts) is well-formed iff it is 
irreflexive and asymmetric. 

The problem of logical timestamps can now be defined as follows. 

Specification of Logical Timestamps. Identify a type ts, a well-formed 
relation less(ts,ts), and assign a timestamp of type ts to each event in the 
given program computation such that for any two events e and / with times- 
tamps ts.e and ts.f the following condition is true: 

• e — >f ^ less{ts.e,ts.f) □ 

Remark. In Lamport’s scalar clock implementation, ts is instantiated to be 
integer, and less(ts.e,ts.f) iff ts.e < ts.f. The vector clock implementation 
by Fidge jH] and Mattern jTj can also be viewed as solving the logical times- 
tamp problem provided we instantiate ts to be an array of integers, and define 
less{ts.e,ts.f) to be true iff ((Vfc :: ts.e[k] < ts.f[k\) A (3k :: ts.e[k] < ts.f[k])). 
Note that less is not a total relation in that for events e and /, it is possible that 
both less{ts.e,ts.f) and lessfts.f ,ts.e) are false. However, both of them cannot 
be true. 

3.2 Solution to Logical Timestamps 

We propose that the timestamp of event ej be of the form {r.ej,c.ej,kn.ej) 
where r.ej denotes the physical clock value of j when ej was created. The vari- 
able c.ej is used to capture the knowledge that j had about the maximum clock 
value in the system when Cj was created. Specifically, c.Cj equals the difference 
between the maximum clock value in the system that j is aware of when ej was 
created and r.Cj. The variable kn.ej is an array of size 2e. The variable kn.ej[t], 
—e<t<e,is used to capture the knowledge about the number of events / such 
that r.f = r.ej + t and / — > ej. (We maintain kn.j\t] only for t < e because j 
cannot learn of events whose timestamp is at least rt.j + e. To see this, observe 
that when rt.j = x, the maximum clock value in the system is a; -I- e. However, 
any message sent with timestamp a: -|- e can be received at j only when rt.j is 
incremented.) 

Our logical timestamp program is as follows: Each process j maintains rt.j, 
r.j, c.j and the array kn.j. The variable rt.j is the physical time at j, and 
(r.j,c.j,kn.j) is the timestamp of the last event on j. We assume that the first 
event is created on each process when its physical clock value is 0. Hence, we 
initialize rt.j, r.j, c.j to be 0. Also, we initialize kn.j[0] to be equal to 1 and all 
other elements in kn.j to be 0. The variable rt.j is updated by the underlying 
system that ensures that G1 is satisfied. The logical timestamp protocol can 
only read rt.j. 
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We ensure that r.j+c.j equals the knowledge that j has about the maximum 
clock value in the system. Consider the case where j creates a local event ej at 
time rt.j. Note that r.j + c.j equals the maximum clock value that j was aware 
of when it created the last event. If r.j+c.j > rt.j then it implies that r.j+c.j is 
still the maximum clock value that j is aware of. In that case, we set c.j to be 
equal to r.j+c.j — rt.j . If r.j+c.j < rt.j then it implies that the maximum clock 
value that j is aware of is the same as rt.j. Hence, we set c.j to 0. We update 
kn based on the previous value of kn.j. Then, we increase kn.j^\ to capture 
the fact that j is aware of one extra event at time rt.j + Q. In the send event, 
r.Cj, c.Cj and kn.Cj are updated in the same way and the message carries the 
timestamp {r.ej,c.Cj, kn.Cj). When j receives message(s), it updates r, c and kn 
in the same way except that it does the update based on the previous event at j 
and the event(s) corresponding to the sending of (those) message(s). Thus, the 
program is as shown in Figure ^ (While presenting the program, for simplicity 
of presentation, we assume that kn.j[t] equals 0 if t < — e or t > e. ) 



Initially: 

rt.j, r.j, c.j = 0, Vt : t 7 ^ 0 : kn.j[t] =0, kn.j[Q] = 1 

Local event e^/Send event Cj (message being sent is rrij) 
c.j ~ max(0, r.j + c.j — rt.j) 
yt : —e < t < € : kn.j[t] := kn.j[t + rt.j — r.j] 
kn.j[0] := kn.j[0] + 1 
r.j ~ rt.j 

r.Cj, C.Cj, kn.Cj := r.j, c.j, kn.j 

if Cj is a send event then r.mj,c.mj, kn.nij := r.j, c.j, kn.j 

Receive event Cj (message m received with timestamp {r.m,c.m,kn.m}) 
c.j := max(0, r.j + c.j — rt.j, r.m + c.m — rt.j) 

Vt : — e < t < e : kn.j[t] := max(0, kn.j[t + rt.j — r.j], kn.m[t + rt.j — r.m]) 
kn.j[0] := kn.j[0] + 1 
r.j := rt.j 

r.Cj, C.Cj, kn.Cj := r.j, c.j, kn.j 

Fig. 1. Logical Timestamp Program 



For the program in Figure Q the following lemmas are true in the absence of 
faults. 

Lemma Hi. 

Ve :: kn.e[c.e] > 0 

Vto :: kn.m[c.m] > 0 

Ve, t : c.e < t < € : kn.e[t] = 0 

Vto, t : c.m < t < e : kn.m[t] =0 □ 

LemmaH2. Ve::c.e<e. □ 
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Lemma Els. Ve,t : —e<t< e : kn.e[t] < |{e}U{/ : / — e A r.f = r.e+t}\. □ 

Lemma El4. Given a system with n processes, Ve,t : : —e<t< e : kn.e[t] < n 

□ 

Remark. We have assumed that the minimum delay in message transmission is 
0 and, hence, messages could be received instantaneously. However, if the un- 
derlying system ensures that there is a minimum delay Smin in transmission of 
messages, we need to maintain kn.j[t] only for —{e — Smin) < t < (e — ^mm)- In 
this case, the number of elements in the array kn.j are reduced to 2(e — Smin)- 
Finally, when Smin > £> there is no need to maintain the array. (We would like to 
point out that Lamport had made the observation for the case where Smin > e 
in 13 .) 

Comparing Timestamps. Given two events ej and fk with timestamps 
(r.Cj, c.Cj, kn.ej) and {r.fk,c.fk,kn.fk), we first use the r and c values to de- 
termine the truth value of less{{r.ej,c.ej,kn.ej), {r.fk,c.fk,kn.fk)). As men- 
tioned above, r.Cj -I- c.Cj captures the maximum clock value that j was aware 
of when ej was created. Thus, if rt.Cj + c.ej < rt.fk + c.fk, we can safely de- 
cide less{{r.ej,c.ej,kn.ej) , {r.fk,c.fk,kn.fk)) to be true. Finally, if rt.ej+c.ej = 
ft-fk + c./fc then we use the array kn to determine the truth value of less 
{{r.ej,c.ej,kn.ej), {r.fk,c.fk,kn.fk)). Towards this end, we use a lexicograph- 
ical comparison. First, we compare kn.e[c.e] with kn.f[c.f]. If these two val- 
ues are unequal then they determine the truth value of less{{r.ej,c.ej, kn.ej), 
(r.fk,c.fk,kn.fk)). If kn.e[c.e] equals kn.f[c.f] then we compare fcn.e[(c.e) — 1] 
and fcn./[(c./) — 1], and so on. Moreover, as shown in Theorem El6, comparing 
only e elements in kn is sufficient. Thus, we define relation less as follows: 

less{{r.e, c.e, kn.e), {r.f, c.f, kn.f)) 
iff 

(r.e-l-c.e < r.f + c.f) V 

((r.e-l-c.e = r.f + c.f) A lexgraph{kn.e, c.e, kn.f, c.f, e)) 
where 

lexgraphiknl, cl, kn2, c2, n) = if (n = 0) then false 

elseif fcnl[cl] < kn2[c2] then true 

elseif fcnl[cl] > kn2[c2] then false 

else lexgraph{knl, cl — 1, kn2, c2— 1, n— 1) 

If for two events Cj and fk, ~^less{{r.ej,c.ej,kn.ej),{r.fk,c.fk,kn.fk)) and 
~^less{{r.fk, c.fk, kn.fk), {r.ej,c.ej, kn.ej)), then the events are ordered based on 
their process ID. 

Note that kn values are compared only when r.e-l-c.e equals r.f+c.f. Thus, 
kn.f [c.f] is compared with fcn.e[r./-fc./— r.e] {=kn.e[c.e\). More generally, kn.f[t] 
is compared with kn.e[r.f + t — r.e\. This is due to the fact that kn.f[t] denotes 
the knowledge about events at r.f + t. Likewise, kn.e[r.f + t — r.e] denotes the 
knowledge about events at r.e+r.f +t — r.e{= r.f+t). 

For the above relation, we first make the following observation and then prove 
that the program in Figure □] satisfies the specification of logical timestamps. 
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Observation El5. The less relation given above is transitive and well-formed. 

□ 

Theorem|2l6. Ve, / :: e — > f ^ less{{r.e,c.e,kn.e), {r.f,c.f,kn.f)) 

Proof. We prove this by induction. Initially, for any two events, e, /, e — / is 
false. Hence, the theorem is trivially true. Now, consider the case where a new 
event, say /, is created at some process, say j. 

1. / is a send/local event: 

Let e be the event that occurred at j just before /. From the assignment to 
c./, we have: r.e-|-c.e < r.f + c.f. If r.e-|-c.e < r.f + c.f then it follows that 
less{{r.e, c.e, kn.e), {r.f, c.f, kn.f)) is true. Hence, we consider the case where 
(a) r.e+c.e = r.f+c.f. In this case, based on how the program updates kn.f, 
we have (b) Vt : — e< t<e A —e<t+r.f—r.e<e: kn.e[t+r.f—r.e]<kn.f[t]. 
Also, since kn.f[0] is increased by the program, we have (c) kn.e[r.f — r.e] < 
kn.f[0]. 

In evaluating less{{r.e, c.e, kn.e), {r.f, c.f, kn.f)), we first compare kn.f [c.f] 
with kn.e[c.e]. Then, we compare kn.f[c.f—l] with A:n.e[c.e— 1], and so on. 
When we have compared C./-I-1 elements, the truth value of less will be de- 
termined since kn.f [c.f —c.f] is greater than kn.e[c.e—c.f] (= kn.e[r.f—r.e]). 
If the truth value of less{{r.e, c.e, kn.e), {r.f, c.f, kn.f)) is determined before 
we compare kn.f [c.f —c.f] with kn.e[c.e—c.f], from (b), less{{r.e, c.e, kn.e) , 
{r.f , c.f, kn.f)) must be true. Moreover, since c.f is less than e at most e 
elements in kn are compared. Finally, from (a), (b) and (c), it follows that 
less{{r.e, c.e, kn.e), {r.f, c.f, kn.f)) is true. Now for any event a, af^e, a — > e 
iff a — > /. Moreover, if a — > e then less{{r.a,c.a,kn.a), {r.e, c.e, kn.e)) is 
true by induction. Hence by transitivity, less{{r.a, c.a, kn.a), {r.f, c.f, kn.f)) 
is also true. 

2. / is a receive event. 

This proof is similar to case 1 except that we need to consider two events 
ei, the event on j just before /, and 62, the corresponding send event. As 
in the previous, we show that less{{r.ei,c.ei,kn.ei), {r.f , c.f, kn.f)) and 
less{{r.e 2 , C.C 2 , kn.C 2 ), {r.f , c.f, kn.f)) are true. 

Once again, for any event a, (o ei A a ^ 62), a — s- / iff (a — > ei V a — > 
62). Hence, by transitivity, less{{r.a, c.a, kn.a), {r.f, c.f, kn.f)) □ 

Bounding the State Space of Logical Timestamps. From Lemmas 02 
and 04, it follows that the c and kn values are bounded. Also, rt values can 
be bounded by letting j maintain only brt.j = {rt.j mod B) where H is a large 
enough constant. More specifically, B should be large enough so that when j 
receives a message m that carries brt.m (instead of rt.m), j should be able to 
update its logical timestamp appropriately. When j receives message mi from I, 
from Gl, rt.l is in the range [rt.j — e.. rt.j + e]. From G2, when m was sent, rt.l 
was in the range [rt.j—e—5..rt.j+e\. Hence, the vector kn.mi carries information 
about events that occurred at time [rt.j — e—6 — e..rt.j + e+e—l]. It follows that 
if B is greater than or equal to 4e+6+l, j would be able to update the logical 
timestamp appropriately. Hence, to bound the space used by j, we maintain 
brt.j = rt.j mod B where B > Ae+8+1. 
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3.3 Bounding and Stabilizing Logical Timestamps 

If the logical timestamps are perturbed by faults such as failure and repair of 
process, corrupted timestamps, temporary violation of system guarantees, we en- 
sure that the program reaches a state from where causality is tracked correctly. 
The stabilization proceeds in four steps. First, if the system guarantees are vi- 
olated then they are restored. Several approaches may suffice for this purpose, 
including clock synchronization programs, GPS clocks, network time protocol, 
etc. The particular approach used is not relevant for our purpose; any approach 
may be used. 

In the second step, we satisfy the local properties in Lemmas Ell, El 2 andEl4. 
Towards this end, for any process j, if c.j is ever assigned a value that is greater 
than or equal to e, kn.j\t\ is assigned a value that is greater than n, kn.j[c.j] is 
zero or for some t, t > c.j, kn.j[t] is nonzero, we set c.j to 0, kn.j[0] to 1 and all 
other elements of kn.j to 0. Likewise, whenever a message is received we ensure 
that Lemmas Ql, 012 and 014 are satisfied for that message before processing 
that message. 

In the third step, messages with corrupt timestamps are either delivered or 
are lost. Consider a message m whose timestamp is corrupted after it was sent. 
After S time has passed, m is delivered or lost (from G2) . Since reception of m 
does not affect rt values, G1 continues to be true. Also, if r, c, kn values are 
changed in a way that local invariants are violated, step 2 restes them. Thus, the 
correction achieved by steps 1 and 2 is not affected and the number of messages 
whose timestamp is corrupted while in transit, is reduced by 1. It follows that 
eventually for any message in transit, the timestamp of ruk is the same as 
the timestamp of k when was sent. 

Consider a state, say s, obtained after steps 1-3. Let rt.j = x in state s. Thus, 
from Gl, we find that for any process k, rt.k is in the range [x—e..x+e\. Moreover, 
from G2, for any message, say m^, in transit, was sent when rt.k was in the 
range [x — e — 8..x + f\. Hence, mk can carry information of about events whose 
physical time is in the range [a; — e — 15— e..a;-|-e-|-e— 1]. It follows that no process 
(or message) has knowledge about events that occur at physical time t where 
t > x + 2e. In other words, if j acquires knowledge about events that occur at 
time t, t>2e, it must be due to actual events rather than due to faults such as 
transients. However, the knowledge about events in the range [a;..a::-|-2e— 1] may 
be incorrect. Now, if rt.j is unbounded and rt.j is advanced to rc-|-3e, the kn.j 
vector will only maintain information about events whose physical time is in the 
range [s-l- 2e..a;-|-4e— 1] and, as discussed above, kn.j will be consistent. After all 
kn values become consistent, causality will be tracked correctly. 

Also, note that the time required for steps 2 and 3 is S. And, the time required 
for step 4 is 3e. Thus, the time required for recovery is 0{e+S). 

Once again, we maintain brt.j = (rt.j mod B) where H is a large enough 
constant. For stabilization, we increase the value of B so that j can distinguish 
between the different rt values that may occur during the computation where 
rt.j is advanced from x to x-|-3e. From the above discussion, the rt values en- 
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countered in that computation are in the range [x — 2e — 5..x+Ae—l\. Hence, we 
maintain brt.j = rt.j mod B where B > 6e+i5+l. Thus, we have: 

Theorem EJ7. If the timestamping program in FigureQis modified so that the 
conditions specified in Lemmas 01, 02 and 04 are satisfied (as given in the sec- 
ond step above) and a variable brt.j = rt.j mod B is maintained to capture rt.j 
then the resulting program uses bounded space and is stabilizing fault-tolerant 
provided B> 6e-|-(5-|-l. □ 

Now, we consider the state space required to implement our logical times- 
tamps. As discussed above, the maximum value for c and brt is 0(e-|-(5). Thus, 
the space required for those variables is 0{log{e + S)). Each element in kn is 
bounded by n and, hence, requires 0{log n) space. It follows that the maximum 
space required for kn is 0(e log n). Summing up the space requirement for 6rt, c 
and kn, we have 

Theorem 1^8 The space used by the stabilizing logical timestamp program 
is 0{e log n + log S). □ 

4 Causal Deterministic Merge 

In this section, we use the timestamp introduced in Section 0 to present our 
program for causal deterministic merge. We first define the problem in Section 
14. 1 L Then, we give the bounded-space stabilizing solution in Section 14.21 



4.1 Problem Statement 

The problem of causal deterministic merge requires that causal delivery is sat- 
isfied and that the messages are merged deterministically. Thus, whenever a 
message is received, it should be buffered until the receiving process determines 
the place where the message should be put while obtaining the causal determin- 
istic merge. Whenever the appropriate place in the causal deterministic merge 
is determined, we deliver that message. Thus, the problem requires that the 
following two specifications are satisfied. 

Note that the above problem statement requires causal deterministic merge 
of messages that reach the destination within <5 time. In other words, messages 
lost in transit are ignored. Such delivery pattern is useful for applications, e.g., 
audio/video streams, where such message loss can be easily tolerated. More- 
over, many such applications require that most of the messages be delivered in a 
timely fashion even though some messages are never delivered. In other words, 
they require that even if a message is lost, the messages that causally depend on 
it should not suffer excessive delay. Based on the application requirements and 
its ability to tolerate message loss, it is possible (cf. Section 0) to fine-tune the 
value of 6: increasing the value of S decreases the number of messages lost but 
increases the maximum delay that messages may incur and the amount of buffer 
required to obtain causal deterministic merge. 
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Specification of causal delivery. 

For messages mi and m2 and for process j that satisfy 
send{mi) — > send{m2), 

Process j is included in the destination set of mi and m2, and 
mi and m2 are received at j 
the following two conditions are satisfied: 
mi is delivered before m2, and 
mi and m2 are eventually delivered 

Specification of deterministic merge. 

For messages mi and m2 and processes j and k that satisfy 

Processes j and k are included in the destination set of mi and m2, and 
Both mi and m2 are received at j and k 
the following condition is satisfied: 

The order in which mi and m2 are delivered at j and k is the same, and 
mi and m2 are eventually delivered 



4.2 Solution 

We first present the condition for causal ordering of two messages, mi and m2, 
received at process j such that m\ — > m2- Let us assume that m2 is received 
first at j. We now find how long m2 should wait at j before being delivered so 
that mi is received and delivered before that. From the timestamp comparison, 
we have mi — > m2 => r.m\ + c.mi < r.m2 + c.m2- Moreover, since c.j > 0, it 
follows that r.mi < r.m2+c.m2. In other words, when mi was sent the physical 
clock value of the sender process was at most r.m2+c.m2. From G 2 , when mi is 
received at j, the clock value of the sender (of mi) will be at most r.m2+c.m2+S. 
Moreover, from Gl, the clock value of j will be at most r.m2+c.m2+^+e. So, mi 
will reach j before rt.j equals r.m2 + c.m2+ <5 + e or mi will be lost. Hence, to 
obtain causal delivery, it suffices that m2 wait until rt.j equals r.m2+c.m2+i5 + e. 
Thus, the delivery condition for message m is delcond{m, j) where 



delcond{m, j) = {rt.j = r.m + c.m + 5 + e) 



Causal Delivery Program. In our causal delivery program, whenever mes- 
sage m is received at process j, m is buffered until delcond{m, j) is satisfied. As 
soon as delcond{m, j) is satisfied, message m is delivered. If multiple messages 
satisfy the delivery condition simultaneously, j determines the causal relation 
(if any) between them using less relation: given two messages mk and mi if 
less{{r.mk, c.mk, kn.mu), {r.mi, c.mi, kn.mi)) is true we deliver mk before m;. If 
both less{{r.mk, c.mk, kn.mk), {r.mi, c.mi, kn.mi)) and less{{r.mi, c.mi, kn.mi), 
{r.mk,c.mk,kn.mk)) are false then we deliver them based on their process ID. 
Observe that there is only one way to deliver these messages. 

From Lemma 02 , the value of c.m is at most e. Thus, message m will be 
delivered before rt.j reaches r.m-|-<5-|-2e. From Gl, the clock value of the sender 
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when m is delivered is at most r.m+<5+3e. Also, if j receives m when rt.j is x 
then r.m is at least x — e. Thus, m can be buffered only for time e+5+e+c.m. 
It follows that no message is buffered for longer than 5+3e. Thus, the following 
theorems are true. 

Lemma S]l. li j sends message m when its physical clock value was r.m then 
it would be delivered before the physical clock value of j reaches r.77i+i5+3e. □ 

Lemma 5]2. A process buffers a message for at most (5+3e time. □ 

Theorem 5]3. Given a system ds{e, S) if our causal delivery program is used to 
deliver messages then the resulting system will be ds{e, i5+3e) (cf. Figure EJ. □ 



ds (e,5; 
ds(e.&i-3e) 



nderlying system 



Bufferb 



/System with causal deterministic merge 
Fig. 2. Guarantees of Our Causal Delivery Program 



Now, we show that the causal delivery program given above suffices for ensuring 
causal deterministic merge in a publish-subscribe system. Formally, 

Theorem 5]4. If two messages mi and m 2 arrive at processes j and k then the 
order in which they are delivered would be same on both processes. □ 

Theorem g]5. If two messages mi and m 2 such that send{mi) — > send{m 2 ), 
arrive at any process j then mi is delivered before m 2 . □ 

The stabilization of causal deterministic merge is similar to that of the logical 
timestamps. Note that even if the logical timestamps of messages are corrupted, 
the causal delivery program never deadlocks since every message is eventually 
considered for delivery and when multiple messages are considered for delivery 
simultaneously the less relation (along with the tie-breaker based on the pro- 
cess ID) determines the order in which they should be delivered. Once the logical 
timestamps are restored to their legitimate states, the subsequent computation 
will satisfy the requirements of causal deterministic merge. Thus, we have 

Theorem |^6. The causal delivery program given above is stabilizing fault- 
tolerant. □ 

5 Extensions 

In this section, we discuss extensions to our logical timestamps presented in Sec- 
tion o such that the array kn.j is eliminated. This is achieved at the cost of 
maintaining unbounded value of c.j. Also, for a standard less relation, we show 
that c.j value cannot be bounded. Subsequently, we show that if the application 
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provides certain guarantees about creation of events in the system, then c.j can 
also be bounded. In both cases, the timestamp is of the form {r.j,c.j). Each 
of these implementations can be used to solve stabilizing causal deterministic 
merge. 



5.1 Eliminating kn.j at the Cost of Unbounded c Value 

In this case, we modify our logical timestamp to be of the form (r.j, c.j) where 
r.j is still the physical clock value of the last event at j. The values of r.j and 
c.j are updated as shown in Figure 0 

Initially: 

rt.j, r.j, c.j = 0, 0, 0 

Local event e^/Send event ej (message being sent is rrij) 
if r.j + 2e < rt.j then c.j = 0; 
else c.j ~ max{0, r.j + c.j — rt.j) 
r.j := rt.j 

r.ej,c.ej,r.mj,c.mj := r.j, c.j, r.j, c.j 

Receive event Cj (message m received with timestamp {r.m,c.m)) 
if (r.m + 2e < rt.j A r.j + 2e < rt.j) then c.j = 0 
else c.j ■.= max{Q, r.j -\-c.j — rt.j, r.m-\-c.m — rt.j + 1) 
r.j := rt.j 
r.Cj,c.ej ■.= r.j, c.j 

Fig. 3. Revised Logical Timestamp Program 

In the program in Figure 0 we first consider the case where the previous 
event was more than 2e time apart. In that case, we reset c.j to 0. Otherwise, 
for the send event the program remains unchanged. For the receive event, if 
r.m + 2e -jt r.f then we ensure that r.m + c.m < r.f + c.f where / is the event 
that corresponds to receive of message m. (In the previous program, we had 
permitted r.m+c.m < r.f + c.f; this allowed us to bound c value at the cost of 
maintaining the array kn.) 

For the program in Figure 0 we define the less(ts, ts) relation for our times- 
tamp of the form (r.j, c.j) as follows: Given two events e and /, 

less((r.e,c.e), (r.f, c.f)) 
iff 

(r.e+e < r.f) V 

((|r.e — r./l < e) A ((r.e-|-c.e < r.f + c.f) V ((r.e-|-c.e = r.f+c.f) A r.e < r.f))) 

We note that stabilization can be added to the program in Figure 0 as discussed 
in Section ro 

Proof for Unboundedness. Now, we show that the use of the above less 
relation requires that the value of c.j is unbounded even in a system with only 
3 processes, say j, k and 1. We begin in a state where rt.j = e, rt.k = 0, rt.l = 0 
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and the c values for all the processes are 0. Now, let j send a message to k. 
This message is sent with the timestamp (e, 0) . Let this message be received 
when rt.k = 1. The receive event at k has the timestamp (l,e). Now, let k send 
a message to I when rt.k equals 2. This message is sent with the timestamp 
(2,e— 1). Let this message be received at I when rt.l = 1. The receive event at I 
has the timestamp (l,e+l). Now, let I send a message to j when rt.l equals 2. 
This message is sent with the timestamp (2,e). Let this message be received at 
j when rt.j = e+1. The receive event at j has the timestamp (e+1, 2). Now, we 
repeat the scenario with j sending a message with timestamp (e+2, 1) as shown 
in Figure m It is straightforward to verify that the c value is unbounded. 



J (e,0) (e+1, 2^ (e+2,1) 

k \ (2,e-l)/ K (fe) 

(l,e)\ / (3,e+l)\ 

^ (1,8+1) (3,8+2) 

Fig. 4. Eliminating kn.j at the cost of unbounded c value for Program in Figure 

El 



5.2 Bounding c Value Using Application Guarantee 

In the logical timestamp program in Figure El the c value is unbounded. How- 
ever, if an application still requires a bounded solution, it can do so by satisfying 
a simple guarantee about creation of events. More specifically, if the application 
periodically provides a window of size 2e such that no events are created on any 
process in that window then c.j can be bounded. The application can easily sat- 
isfy this guarantee if each process ensures that no events are created when the 
physical clock of that process is in the interval [ax..ax+2e\ where x is the period 
and a is a natural number. In this case, the bound on c will be proportional to 
the period x. 

6 Discussion 

In this section, we discuss the role that our assumptions played in our program, 
their fine-tuning and their generality. We also discuss other interpretations of 
G1 and G2. 

Role of G1 and G2. In presenting our programs for logical timestamps and 
causal deterministic merge, we expected two guarantees from the underlying 
systems. Both these guarantees were necessary to obtain bounded stabilizing 
solutions. In this sense, the guarantees our solution expects are minimal. It is 
obvious that if only guarantee G1 were available, we could not derive a bounded 
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stabilizing solution; a message that is delayed for a long enough time, so that a 
message with a similar timestamp can be generated in the meanwhile, can vio- 
late the requirements of logical timestamps (and causal deterministic merge). If 
only guarantee G2 were available, it would not be possible to bound the number 
of elements in kn; we used G1 to determine the number of elements that are 
maintained in kn. 

Fine-Tuning G1 and G2. In a given system, one needs to determine the 
values of e and 6. It may also be possible to fine-tune these values depending 
upon other application requirements. The value of e depends upon the closeness 
of clock values and their precision. The closeness of clock values will depend 
upon the clock synchronization algorithm used to correct them. If we provide 
higher priority for the process that corrects the clock on each processor, provide 
higher priority for messages sent by the clock synchronization algorithm, and 
reduce the non-determinism involved in the clock synchronization algorithm, we 
can reduce the value of e (and, hence, buffer requirements, recovery time, etc). 
The value of e can also be reduced by increasing the minimum time difference 
between events. Regarding <5, it is easy to see that there is a tradeoff between 
the value of i5 and the percentage of messages lost. For example, we can use 
techniques such as forward-error-correction and send parity messages to reduce 
the message loss. However, in that case, the value of S will be high as we need 
to wait for these parity messages to arrive at the destination. 

Generality of G1 and G2. The guarantees expected in our model are sat- 
isfied by most existing systems. These guarantees are, however, stronger than 
that in |B| . Specifically, G2 can be easily obtained using the fail-aware datagram 
service. However, to the best of our knowledge, it is not known whether one can 
implement G1 and ensure that in the presence of faults, G1 is established in that 
model. 

Other Interpretations of G1 and G2. In our model, the value of rt.j may 
not be related to the auxiliary global time. We have permitted this explicitly 
to allow for the case where rt.j denotes some other progress measure for the 
program. For example, in 0, rt.j could denote the number of reset operations 
that j has performed. Or, in a checkpointing and recovery program, rt.j could 
denote the incarnation number of j. (Of course, in this case, we will permit 
K events between incrementing of rt.j where iF is a predetermined constant.) 
Thus, if a program provides the two guarantees G1 and G2 with respect to the 
new interpretation of rt.j, our solutions can also be applied. 

7 Conclusion and Future Work 

In this paper, we presented (in Section 0 a bounded state, stabilizing solution 
for causal deterministic merge. Our solution ensured that even if faults such 
as message corruption, improper initialization, temporary violation of system 
guarantees or transients occur, eventually the causal deterministic merge is pro- 
vided. In our algorithm, the recovery was quick and proportional to the system 
guarantees. Our solution used only 0{elog n + log (5) space. 
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To develop a solution to causal deterministic merge, we presented (in Section 
13) a self-stabilizing solution for logical timestamps. In our solution, the space 
cost to implement logical timestamps and the time required to recover from 
faults is proportional to the guarantees provided by the system. 

We also presented (in Section [1 variations of the above solution where we 
showed that the space required to implement logical timestamps can be made 
independent of the number of processes if the application satisfies a simple con- 
dition on how events are created. For a particular Hess' relation, in Section El 
we showed that it is impossible to bound the size of logical timestamps without 
such a condition. 

In developing the above solutions, we expected the underlying system to 
make two guarantees G1 and G2. We argued (in Section El) that these guaran- 
tees are reasonable in that existing distributed systems satisfy them. We also 
pointed out how these guarantees can be fine-tuned in a given system. 

We showed that given a system that satisfies G1 and G2, the system obtained 
after considering the buffering introduced by our causal delivery program also 
satisfies G1 and G2 (with slightly different parameters). We expect this property 
to be useful while building a hierarchical system. At the top level of this system, 
we will have subnetworks that consist of producers and mergers (which act as 
subscribers in this subnetwork). In turn, the mergers act as producers in another 
subnetwork and so on. In this case, the values of e and 8 may be different in 
different subnetworks. 

Our solution also improves the previous A causal order solution in W- 
Specifically, the solution in 0 assumes the existence of a global clock, and re- 
quires 0{n?) space that grows unbounded as the computation progresses. The 
solution in |3 allows the clock values to differ. However, they also require O(n^) 
unbounded space, and can miss some causal dependencies. By contrast, we do 
not assume the existence of a global clock and use only 0{log n) bounded space 
for our timestamps. In | 2 |, a message received within time A will be delivered 
within time A. If we assumed the existence of global clocks (e = 0) then our 
solution will also provide the exact same guarantee (in addition to stabilizing 
fault-tolerance and bounded logarithmic space). 

Regarding work on deterministic merge, Aguilera and Strom have pre- 
sented a deterministic merge program. In their program, the authors assume 
that the expected message rate of all producers is known and that the producers 
send dummy messages if they do not produce the data at the given rate. Also, if 
the producers produce messages at a faster rate than expected then the message 
delay grows unbounded. And, the order in which the messages are delivered in 
their program is not related to the causal order between them. By contrast, we 
provide causal delivery, do not assume the knowledge about the rate of produc- 
tion of messages, and limit the amount of time for which a message needs to be 
buffered. However, unlike the program in [Q, our program requires that G1 and 
G2 are satisfied. 

Our causal delivery algorithm is useful in multimedia real-time applications 
and group-ware real-time applications. In these applications, buffering is limited 
and the data is valuable only if it is received within some limited time. Our 
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protocol allows precomputation of buffering requirements and expected delays. 
Moreover, as discussed in Section 0 we can fine-tune the suitable values for e 
and S based upon available buffers and maximum permitted delay. It follows 
that it would be possible to exploit system level guarantees to further provide 
guarantees about the flow of the multimedia real-time data. 

Our work suggests several directions for future work. Regarding logical times- 
tamps and causal deterministic merge, future work includes identifying the lower 
bound to solve these problems with G1 and G2. In |II3> we have presented a 
causal delivery program whose complexity is 0{nlog{e + 6)) whereas the com- 
plexity of the program presented in Section 0 is 0(e logn + log S). It would 
be interesting to determine if we can reduce the complexity further so that it is 
logarithmic in both e and n. 
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Abstract. We present a simple deterministic distributed depth- first to- 
ken circulation (DFTC) protocol for arbitrary rooted network. This pro- 
tocol does not require processors to have identifiers, but assumes the 
existence of a distinguished processor, called the root of the network. 
The protocol is self-stabilizing, meaning that starting from an arbitrary 
state (in response to an arbitrary perturbation modifying the memory 
state), it is guaranteed to converge to the intended behavior in finite 
time. The proposed protocol stabilizes in 0(n) time units, i.e., no more 
than the time for the token to visit all the processors (in the depth-first 
search order). It compares very favorably with all previously published 
DFTC algorithms for arbitrary rooted networks — they all stabilize in 
0(n X D) times, where D is the diameter of the network. 



1 Introduction 

Modern distributed systems have the inherent problem of faults. The quality of 
a distributed system design depends on its tolerance to faults that may occur 
at various components of the system. Many fault tolerant schemes have been 
proposed and implemented, but the most general technique to design a system 
that tolerates arbitrary transient faults is self-stabilization |^. A self-stabilizing 
protocol guarantees that, starting from an arbitrary initial state, the system 
converges to a desirable state in a finite time. 

The depth-first token circulation problem (DFTC) is to implement a token 
circulation scheme where the token is passed from one processor to another in 
the depth-first order such that every processor gets the token at least once in 
every token circulation cycle. This scheme has many applications in distributed 
systems. The solution to this problem can be used to solve the mutual exclu- 
sion, spanning tree construction, synchronization, finding the biconnected com- 
ponents, and many other important tasks. 

Related Work. Dijkstra introduced the property of self-stabilization in distri- 
buted systems by applying it to algorithms for mutual exclusion on a ring j^j. 
Several other deterministic self-stabilizing token passing algorithms for rooted 
ring networks and linear arrays of processors have been proposed in the liter- 
ature, e.g., BilllTlITlI^. Dolev, Israeli, and Moran 0 gave a self-stabilizing 
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mutual exclusion protocol by circulating a token in the depth-first search or- 
der on a tree network under the model whose actions only allow read/ write 
atomicity. Deterministic self-stabilizing DFTC algorithms for tree networks can 
be found in ironic lini . In the authors provide a token circulation scheme 
for arbitrary networks by constructing a spanning tree and implementing the 
depth- first token circulation scheme on the constructed spanning tree. There 
exists many self-stabilizing spanning tree construction algorithms in the litera- 
ture, e.g., [HI2US|. The algorithms proposed in pirrarrn] can be combined with 
any spanning tree construction providing a token circulation scheme for arbi- 
trary networks. Note that the resulting protocol is not necessary a depth-first 
token circulation for general networks, but it depends on the structure of the 
constructed tree on which the DFTC works. 

Self-stabilizing depth-first token circulation for arbitrary rooted networks 
(without pre-computing a rooted spanning tree) was first considered by Huang 
and Chen eg. The solution of m needs processors having 0{n) states (0(log n) 
bits). Subsequently, several protocols to solve the self-stabilizing DFTC were pro- 
posed All these papers attempted to reduce the space complexity 

to 0{A), where A is the degree of the network. The algorithm presented in jZj 
offers the best space complexity. 

All self-stabilizing DFTC algorithms for general networks in the current lit- 
erature jn oiaimirTj use two colors to distinguish two consecutive token cir- 
culation cycles. To make sure that each cycle reaches all the processors of the 
network, each cycle must start from a configuration where all the processors 
are uniformly colored. For all the above solutions, in any configuration where 
an error exists, it is required (i) to correct the abnormal token circulations 
(token circulations not initiated by the root), (ii) to correct processors abnor- 
mally colored. The first case (Case (i)) is solved in 0{n) time units in jg and 
0{n X D) in |ZI01EIE|, where D is the diameter of the network. But, all 
the solutions in |Z|E1E1E1E] need 0{n x D) time units to correct processors 
abnormally colored (Case (ii)). So, the stabilization time for all the existing 

Contributions. In this paper, our goal is to reduce the stabilization time com- 
plexity of the DFTC scheme. We present a self-stabilizing DFTC algorithm 
(called Algorithm fViFTC) for general networks with a distinguished root. Al- 
gorithm fViFTC needs processors having 0{n) states, but it requires no colors 
to distinguish the token circulation cycles. Our solution stabilizes in only 0{n) 
time units. Note that our stabilization time complexity and the time for the 
token to complete one DFTC cycle are of the same order. 

The rest of the paper is organized as follows: In Section|3 we describe the dis- 
tributed systems and the model in which our token circulation scheme is written, 
and give a formal statement of the token passing problem solved in this paper. 
In Section El we present the token passing protocol, and in the following section 
(Section El , we give the proof of stabilization of the protocol. In Section O, the 
space and the stabilization time complexity of the protocol are given. Finally, 
we make concluding remarks in Section El 
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2 Preliminaries 

In this section, we define the distributed systems and programs considered in 
this paper, and state what it means for a protocol to be self-stabilizing. We then 
present the statement of the DFTC problem and its properties. 



2.1 Self- Stabilizing System 

A distributed system is an undirected connected graph, S = (V,E), where V is 
a set of processors (|P| = n) and E is the set of bidirectional communication 
links. We consider networks which are asynchronous and rooted, i.e., all pro- 
cessors, except the root are anonymous. We denote the root processor by R. A 
communication link (p, q) exists iff p and q are neighbors. Every processor p can 
distinguish all its links. To simplify the presentation, we refer to a link (p, q) of 
processor p simply by the label q. We assume that the labels, stored in the set 
Np, are arranged in some arbitrary order >-p (Vgi, (72 € Np :: (qi >-p <72) A ((72 >-p 
<7i) (<7i = (72)). We assume that Np is a constant and is maintained by an 

underlying protocol. 

Each processor executes the same program except R. The program consists 
of a set of shared variables (henceforth referred to as variables) and a finite set 
of actions. A processor can only write to its own variables and can only read 
its own variables and variables owned by the neighboring processors. So, the 
variables of p can be accessed by p and its neighbors. 

Each action is uniquely identified by a label and is of the following form: 

< label >:: < guard > — > < statement >. The guard of an action in the 
program of p is a boolean expression involving the variables of p and its neigh- 
bors. The statement of an action of p updates variables of p. An action can be 
executed only if its guard evaluates to true. We assume that the actions are 
atomically executed: the evaluation of a guard and the execution of the corre- 
sponding statement of an action, if executed, are done in one atomic step. The 
atomic execution of an action of p is called a step of p. 

The state of a processor is defined by the values of its variables. The state 
of a system is a product of the states of all processors (g V). In the sequel, we 
refer to the state of a processor and system as a (local) state and configuration, 
respectively. Let a distributed protocol 7 ^ be a collection of binary transition 
relations denoted by i-^-, on C, the set of all possible configurations of the sys- 
tem. A computation of a protocol 7 ^ is a maximal sequence of configurations 
e = (70,71, ...,7i,7i+i, ...), such that for i > 0,7i i-^- 71+1 (a single computation 
step) if 7i+i exists, or 71 is a terminal configuration. Maximality means that 
the sequence is either infinite, or it is finite and no action of V is enabled in 
the final configuration. All computations considered in this paper are assumed 
to be maximal. The set of computations of a protocol V in system S starting 
with a particular configuration a G C is denoted by Ea- The set of all possible 
computations of V in system S is denoted as £. 

A processor p is said to be enabled in 7 (7 G C) if there exists at least an 
action A such that the guard of A is true in 7. When there is no ambiguity, 
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we will omit 7. We consider that any processor p executes a disable action in 
the configuration step 7^ 7^+1 if p is enabled in 7^ and not enabled in 7i+i, 

but does not execute any action between these two configurations. (The disable 
action represents the following situation: At least one neighbor of p changes its 
state between 7^ and 7i+i, and this change effectively made the guard of all 
actions of p false.) Similarly, an action A is said to be enabled (in 7) at p if the 
guard of A is true at p (in 7) . We assume a weakly fair and distributed daemon. 
The weak fairness means that if a processor p is continuously enabled, then p is 
eventually chosen by the daemon to execute an action. The distributed daemon 
implies that during a computation step, if one or more processors are enabled, 
then the daemon chooses at least one (possibly more) of these enabled processors 
to execute an action. 

In order to compute the time complexity, we use the definition of round cni. 
This definition captures the execution rate of the slowest processor in any com- 
putation. Given a computation e (e G £), the first round of e (let us call it e') is 
the minimal prefix of e containing the execution of one action (an action of the 
protocol or the disable action) of every continuously enabled processor from the 
first configuration. Let e" be the suffix of e, i.e., e = e' e" . Then second round of 
e is the first round of e" , and so on. 

Let A be a set. x \~ P means that an element x G X satisfies the predicate 
P defined on the set A. A predicate is non-empty if there exists at least one 
element that satisfies the predicate. We define a special predicate true as follows: 
for any x G X , x \~ true. 

We use the following term, attractor in the definition of self-stabilization. 

Definition 1 (Attractor). Let X and Y be two predicates of a protocol V de- 
fined on C of system S. Y is an attractor for X if and only if the following 
condition is true: Vo h A : \/e G £a '■ e = (70,71, ...) :: > 0,Vj > 1,7^- h Y . 

We denote this relation as X \>Y . 

Definition 2 (Self- Stabilization). The protocol P is self-stabilizing for the 
specification SV-p on £ if and only if there exists a predicate Lp ( called the le- 
gitimacy predicate) defined on C such that the following conditions hold: 

1. Vo h Cp : \/e G £a eh SVp (correctness) , 

2. true >Cp (closure and convergence). 

2.2 Specification of the Depth-First Token Circulation Protocol 
Definition 3 (DFTC-cyde). 

A finite computation e G £ is called a DFTC-cycle if the following conditions 
are true: 

[S] Exactly one processor holds a token in any configuration; 

[LI] The root initiates the DFTC-cycle by sending out exactly one token; 
[L2] When a processor p receives a token, p sends the token to a processor 
following the depth-first search order. 

We consider a computation e of fVFTC to satisfy SVdftc iff e is an infinite 
repetition of depth-first circulation cycles (as in Definition 0 • 
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3 Depth-First Token Circulation Algorithm 

We borrow the following term from |E|: The first DFS tree of the graph G is 
defined as the DFS spanning tree rooted at R, created by traversing the graph 
in the DFS manner and visiting the adjacent edges of every processor p in the 
order induced by >-p. The depth-first token circulation proposed in this section is 
designed in such a way that the token is passed among the processors following 
the first DFS tree. We first present the data structure used by the processors. We 
then explain the normal behavior of the algorithm, followed by the description 
of the error correction mechanism. 



3.1 Data Structures 

The self-stabilizing depth-first token circulation algorithm (Algorithm f'DT'TC) 
is shown in AlgorithmQ] for the root (R) and in AlgorithmElfor the other proces- 
sors. The macros are not variables and are dynamically evaluated. The predicates 
are used to describe the guards of the actions in Algorithm fVTTC. 



Algorithm 1 {fVTTC) For Processor R (p = R) 

Constants: Np : set of (locally ordered) neighbors; Lp = 0; 

Variables: Sp £ Np U {done}; 

Macros: 

Nextp = {q = min^p{g' £ Np :: {q' ^p Sp) A (S',/ = idle)}) if q exists, 
done otherwise; 

Predicates: 

Forward{p) = [Sp = done) 

Backward{p) = [Sp = g :: q £ Np /\ Sq = done) 

Locked{p) = (3g £ Np :: Sq idle) 



Actions: 

F :: Forward{p) A -<Locked{p) ^ Sp ■.= Nextp\ 
B :: Backward(p) — > Sp := Nextp-, 



Each processor p maintains two variables, Sp and Lp. li p = R, then Sr £ 
NjiU{done}. Otherwise, (p yf R), Sp £ NpU{idle, done, wait}. For every processor 
p, if there exists q £ Np such that Sp = q, then p (resp., q) is said to be a predeces- 
sor of q (resp., the successor of p). In other words, when Sp ^ {done, idle, wait}, 
Sp plays the role of a pointer pointing to the neighbor to which p sent the token. 
Variable Lp contains the length of the path followed by the token from the root 
to p. Since the length from the root to itself is 0, Lr must be equal to zero, and 
hence, is shown as a constant in Algorithm E 
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Algorithm 2 {fT>TTC) For other processors (p R) 



Constants: Np : set of (locally ordered) neighbors; L-max = n — 1; 

Variables: Sp £ NpVJ {idle, done, wait}; Lp £ [1, . . . , Lmax}; 

Macros: 

Nextp = {q — min^plg' £ Np :: [q' >~p Sp) A = idle)}) if q exists, 
done otherwise; 



Predp = 

RealPredp = 
Predicates: 
ForwardI (p) 
ForwardW (p) 
Forward(p) 
Backward{p) 
Locked{p) 
Cleanijp) 

CleanW{p) 

EDetect{p) 



[q £ Np :: Sg = p}; 

[q £ Predp :: Lg = Lp — 1}; 

= {\Predp\ = 1) A {Sp = idle) 

= (jPredpj = 1) A {Sp = wait) 

= F orwardl {p) V ForwardW {p) 

= {\RealPredp\ = 1) A {Sp = qw q £ Np /\ Sg = done) 

= (3|J e Np :: Sg = done) V (3g £ Predp :: Lg > Lmax) 

= {Sp = done)A 

((Vq G Np :: Sg £ [idle, done}) V (3q £ Np :: Sg = wait)) 
= {Sp = wait) A (|Predp| = 0)) 

= {Sp = q q £ Np A {{Lg = 0) V {\RealPredp}\ = 0))) 



Actions: 




F 


: Forward{p) A -^Locked{p) 


B 


: Backward{p) 


W 


: ForwardI {p) A Locked{p) 


c 


: Clean{p) 


Error 


: EDetect{p) V CleanW {p) 



Sp ■ Nextp, Lp . Lg. — -\- 1, 

Sp := Nextp; 

Sp := wait; 

Sp := idle; 

Sp := idle; 



3.2 Normal Behavior of Algorithm f'DiF'TC 

In this subsection, we first ignore the parts of the algorithm where Variable Lp 
is involved (e.g., Macro RealPredp, the right part of Predicate Loeked, etc.), 
because Variable Lp is used to handle the correction of abnormal situations 
only. Once stabilized, the system must contain only one token which circulates 
following the first DFS tree. In such a configuration, a processor can make a 
move only if it holds the token. Holding the token means either Forward{p) 
or Baekward{p) is true. Formally, Token{p) = Forward{p) V Backward{p). A 
processor p (p yf R) is said to be idle {Sp = idle) when p is ready to participate in 
the next DFTC cycle. Similarly, a processor p is in state done {Sp = done) when 
p has completed the current DFTC cycle. Let us omit the state wait {Sp = wait) 
for a while. 

Macro Predp (Algorithm E|) is the subset of the neighbors of p which took p 
as successor. When Algorithm fVFTC behaves accordingly to its specification, 

I Predp I = 1. Macro Nextp is used to choose the next successor of p. 

Let us explain the normal behavior of Algorithm fVFTC following the ex- 
ample shown in Figure D For a better presentation, in the example, we assume 
that the daemon chooses all enabled processors to execute an action (i.e., the 
daemon behaves as a synchronous daemon). In Configuration (f), only Action F 
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Fig. 1. An example showing the normal behavior of Algorithm jT>TTC. 



is enabled at R. This means that the root is ready to start a new DFTC cycle. 
The root chooses Processor a as the successor (Macro Next). This is shown 
in Configuration (ii). Similarly, Processor a chooses a successor (Configuration 
(in)) executing Action F. This process of extending the path continues until 
c executes Action F. Processor c does not have any neighbor to choose from. 
So, c executes Sc := done (see Macro Next). This indicates to its predecessor 
b that the token has traversed all processors reachable from c in the first DPS 
tree (Configuration (f)). Now, Backward(b) becomes true and b can execute 
Action B. Since b has no more unvisited neighbors, Sb becomes equal to done 
(Configuration (vi)). Next, Backward(a) also becomes true leading a to execute 
Sa ■= done (Configuration (vii)). Note that when Processor a is done. Proces- 
sors b and c can clean their state by changing their state from done to idle. 
So, Actions F and C (or B and C) can run concurrently. Actions F and B are 
repeated until all processors are visited by the token (Configurations (viii) to 
(a;)). The configuration following Configuration (x) is Configuration (i) when 
Processors a and d clean their state. Then, R starts a new DFTC cycle. 

As we mentioned before, the example shown in Figure Q assumed a syn- 
chronous daemon working synchronously. But, our algorithm works in an asyn- 
chronous environment as well. Due to asynchrony in the network, some pro- 
cessors may be involved in a DFTC cycle whereas the others are still cleaning 
their state following the previous DFTC cycle. But, we need to make sure that 
these two cycles do not confuse each other. We solve this problem by using the 
cleaner m- The cleaner is a tool adding two preventives in the algorithm, (i) 
A processor p is allowed to change its state from done to idle (Action C) only 
when all its neighbors are done or idle (Predicate Clean(p)). (ii) A processor p 
such that Forward(p) is true can select its successor when each of its neighbors 
which is not its predecessor is idle (Locked(p) must be false). Thus, the pro- 
cessors which are slow to execute their Action C, are protected from the next 
DFTC cycle. 
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Finally, consider the configuration in which Forward(p) is true on processor 
p and p is waiting for some of its neighbors being slow to execute Action C. 
Note that in such a configuration, p can execute Action W (both ForwardI{p) 
and Locked{p) are true). By executing Action W, p changes its state Sp from 
idle to wait. This is called the wait mechanism. The wait mechanism enforces 
each neighbor g of p which is slow to change its state from done to idle, what- 
ever the state of the other neighbors of q (see Predicate CleanQ). This does not 
slow down the progress of the token because it is executed concurrently with the 
cleaning phase (of slow processors). However, it is clear that this mechanism is 
not necessary to make sure that Algorithm fVFTC behaves correctly (accord- 
ing to its specification) . We show in the next subsection (Subsection IT 11 1 that 
the wait mechanism ensures the liveness of Algorithm fDFTC starting from an 
abnormal configuration. 

3.3 Error Correction 

We now consider the abnormal situations due to the unpredictable initial config- 
urations and transient errors. An example showing an illegitimate configuration 
is shown in Figure]^ 

Informally, a configuration is an '''' abnormal" configuration if one of the fol- 
lowing situation occurs: 

[ACl] Some neighbors of the root have the root as a successor (see Processor d 
in Figure ED; 

[AC2] Some processors are involved in some successor chains not rooted in R 
(in Figure El all processors other than the root and h, i, n, and o are in 
such a configuration); 

[ACS] Some processors are in the state wait without predecessor (Processor h 
in Figure 0 is in this configuration). 




Fig. 2. An example showing an abnormal configuration. 



The system may contain chains of successors forming cycles (possibly includ- 
ing R) or “alive” chains, i.e., chains of successors not rooted in R which causes 
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several tokens circulating in the network. In order to break the cycles, we use 
the length of the path followed by the token to p, i.e., Variable Lp US). When 
Sp G {idle, done, wait}, the value of Variable Lp is ignored. Any processor p 
sets its variable Lp to the value of its predecessor by executing Action F (Lr is 
fixed to 0). (In Figure El Variable Lp is shown for processors having a succes- 
sor.) Thus, if some processors p are involved in a cycle, some of them must have 
their level different from their predecessor q in the cycle, i.e., Lp ^ Lq + 1. In 
Figure E| Processors a, b, and j are such processors. Every predecessor g of p 
such that Lq = Lp — 1 is called a real predecessor of p (see Macro RealPredp of 
Algorithm Ej) . A processor p without a real predecessor but having a successor 
are called worm bottom. The successor chain starting in p is called a worm. 

Using Lp, a neighbor of the root can also detect that it is a predecessor of R 
because the level of its successor is 0. Variable Lp also allows to stop the progress 
of worms. A processor p ready to forward a token {Forward(p) is true) will be 
allowed to do so if the level value of its predecessors is lower than Lmax = n — 1 
(see Predicate Locked{p) in Algorithm El . Action Error of Algorithm El is used 
to handle all the abnormal configurations described above. 

So far, we did not explain the use of the wait mechanism in dealing with the 
errors. Consider the configuration where Forward(p) is enable at a processor 
p ^ R. Due to the unpredictable initial configuration, some neighbors of p can 
be in the state done. An example of such a configuration is shown in Figure El 
Processor o is waiting for Processor n to execute Action C, but Processor n is 
waiting for all its neighbors (in particular R and Processor i) to be done or idle. 
Note that the two features of the cleaner prevents this kind of configurations to 
occur during the normal token circulation (see Subsection IT 211 . Now, assume that 
the network contains no worm, and no processor has R as a successor. Then, only 
one token exists in the network (Token{p) is true). The system seems to be in a 
deadlock configuration. We introduce the wait mechanism to unlock such a con- 
figuration. By executing Action W, the processor holding the token enforces its 
neighbors to change their state from done to idle (see Predicate Cleani)). In Fig- 
ure El once Processor o executed Action W, Action C is enabled at Processor n 
which eventually executes Action C by fairness. But, the state wait introduces 
the third abnormal situation we mentioned above (Case [AC3]): some processors 
are in state wait with no predecessor. Predicate CleanW () and Action Error 
handle the correction of this abnormal situation. 



A DFTC-cycle starts in a configuration which satisfies some special proper- 
ties in terms of the state variable values. We now define an equivalence class of 
configurations which satisfies these properties. 

Let us define the set SC (starting configuration set) such that 



4 Stabilization Proof of Algorithm f'DJ^'TC 




Sf^ = done A (Vp G Vr :: Sp = idle) A 
(Vp G U \ ({R} U Vr) :: Sp G [idle, done}) 
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We define two configurations 7 and 7 / as equivalent with respect to the set SC 
if the following condition is true: 7 ^sc 1' iff (7 S SC) A ( 7 / G SC). 

Definition 4. Let Cdftc be the legitimacy predicate such that a configuration 7 
satisfies Cdftc (l 1“ Cdftc ) if the following holds: V 70 G SC, 3e = 70 , 71 , . . . G 
£' 10 ^ > 0 :: 7 = 7 ^. 

We exhibit a finite sequence of state predicates Ao, A\,. . . , Am, of Protocol 
fOTTC such that the following conditions hold: (i) ^0 = true (meaning any 
arbitrary state); {ii) Vj : 0 < j < rn :: Aj t>Aj+i (Hi) Am =7 Cdftc- 

Definition 5 (Successor Path). For any node p such that Sp G Np, the suc- 
cessor path p is the unique path p\,p 2 , ■ ■ ■ ,Pk such that (z) p = pi, (ii) k >2, 
(Hi) Vz, 1 < z < A: — 1, = Pi+i, and (iv) Sp. i {idle, done, wait} Sp. G 

{pi,P 2 , . ■ - ,pk-i}- Vz > 1, Pi is said to belong to p and is denoted as pi G p . k 
is called the length of^. 

Let us define the following predicates : 

LevelError(p) = (Sp — q : q € Np A (Sq 7 ^ done => (Sq = q \ q £ Nq A Lq Lp 1))) 
Top(p) = (p — R A (Sp V done =i> LevelError(p))) V 

(p / R A ((Sp G {idle, wait} A Predp / 0) V LevelError(p))) 

A processor p is called a top in Configuration 7 if Top(p) holds in 7 . When 
there is no ambiguity, we will omit 7 . In Figure El Processors a, b, d, k, m, o, 
and p are top processors. 

Remark 1. Every successor path ~p contains at least one top processor. 

Let us define another predicate : 

W Bottom(p) = (Sp G NpARealPredp = 0). A processor p is called a Worm Bot- 
tom (or simply a Bottom) in Configuration 7 if W Bottom(p) holds in 7 . When 
there is no ambiguity, we will omit 7 . Note that if Predp = 0, then W Bottom is 
true if Sp G Np. In Figure 0 Processors c, d, g, j, I, and p are bottom processors. 

Definition 6 (Worm). A successor path ~p = p\,p 2 , . . .pk is a worm (path) if 
and only if the following conditions are true: 

(1) WBottom(p), 

(2) Vz, I < z < fc, Top(pi) ^ i = k. 

Note that the length of a worm can be equal to 1. That is the case where the 
state of the successor of the worm bottom is done. In Figure 0 the following suc- 
cessor paths are worms: (c,b), (g,a), (d), (j,e,f,k), (l,m), and (p). Processors 
(d) and (p) form worms of length 1 . 

Remark 2. Every worm p = p\,p 2 , . . . ,Pk with length greater than 1 (A: > 1) is 
an (elementary) path such that Vz, 1 < z < A: — 1, Lp.q^.,^ = Lp^ -\- 1. 
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4.1 Root without a Predecessor 

Let Pred^ be the set of processors p such that Sp = R. We define .4i = {Pred^ = 
0). We now show that, .4i is an attractor. 

Lemma 1. Every processor p G Predp, remains in Predp, at most one round. 

Proof. Except Action Error, no action is enabled on processors in Pred^. Since 
Lr is a constant (Lr = 0), for every p € Pred^, EDetect{p) is true while p does 
not execute Action Error. So, by fairness, every p S Pred^ eventually executes 
Action Error. □ 



Theorem 1 (Aoi>Ai). In at most 1 round, Pred^ = 0, and, once this happens, 
PredR = 0 forever. 

Proof. Since S'r cannot be equal to idle, R cannot be chosen as a successor by any 
of its neighbors, (see in macro Nextp). So, |PredR| cannot increase. By Lemma0 
in at most 1 round Pred^ — 0. Once Predf^ — 0, again because R cannot be cho- 
sen as a successor, Pred^ = 0 forever. □ 

4.2 Worm Destruction 

In the remainder, we assume that A\ always holds. Let W be the number of 
worms in the network. Define A2 = A\ /\ {W = 0). We now show that A2 is an 
attractor, i.e., the network eventually contains no worm. 

Lemma 2. Every worm bottom p remains a worm bottom until it executes Ac- 
tion Error. 

Proof. Let 7^ 1— > 7^+1 be the transition of an execution e such that (1) p is a 
worm bottom in 7^ and (2) p does not execute Action Error during 7^ 1— > 7i+i. 
Since Sp idle in 7^, no neighbor of p can choose p as successor during 7^ 1-^ li+i 
(see Macro Nextp). So, \Predp\ in 7^ is lower than or equal to \Predp\ in 7i+i. 
Moreover, no g G Predp can change its level value {Lq) during 7^ 1— > (no rule 
for processors having a successor exists to do so). So, |i?ea^Predp| in 7^ is lower 
than or equal to |i?ea^Prec?p| in 7^+1. So, p remains a worm bottom in 7i+i. By 
induction on each transition following 7^ 1— > ^i+i in e, p remains a worm bottom 
until it executes Action Error. □ 



Corollary 1. A worm bottom processor p can stay as a worm bottom at most 
one round. 



Lemma 3. If for any 7; 1-^ li+i, « processor p becomes a worm bottom with 
level Lp in -ji+i, and p was not a worm bottom in 7i, then in 7i, Vg € Predp, q 
is a worm bottom with level Lq = Lp — 1 . 
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Proof. Assume by contradiction that there exists at least one q G Predp which is 
not a worm bottom in 7^. So, both p and q are not worm bottom in 7^. Consider 
the three following cases: 

1. Sp G {idle, wait} in 7^. Since p is a worm bottom in 7^+1, p executes an 
action during 7^ 1-^ 7^+1 such that Sp changes from Sp G {idle, wait} (in 7^) 
to q' , q' G Np, in 7i+i. The only possible action that can make this happen 
is Action F. This implies that Sq = p in ji, and q executes an action during 
7i li+i such that Sq ^ p or Lq = Lp — lin 7^+1 . The only possible action 
which q can execute to make this change is Action Error. So, g is a worm 
bottom in 7^, which contradicts the assumption. 

2. Sp = done in 7^. Only Action C can be executed when Sp = done. But, Predp 
must be empty. So, p cannot become a worm bottom during 7^ 1-^ 7i-i-i, which 
contradicts that p is a worm bottom in 7i+i. 

3. Sp = q' , q' G Np in 7^. Then, q can execute Action Error only because 
Sq ^ {idle, wait, done} prevents q to execute Actions E, W, C, and Sp ^ 
done prevents q to execute Action i3. So, g is a worm bottom in 7,, which 
contradicts the assumption. 

□ 



Corollary 2. For any 7^ 1-^ 7i-i-i, if there exists no worm bottom in 7^, then 
there exists no worm bottom in 7i+i. 

From Remark 121 Corollary ^ and Lemma 0 directly follows: 

Lemma 4. The highest level of any worm bottom in the network increases by 
at least one in one round. 

From Corollary 0 , Lemma 0 , and the fact that the level of every processor 
is bounded by Lmax, the following theorem holds: 

Theorem 2 (Ai >A2)- In at most L^ax rounds, W = 0 (the network contains 
no worm) and, once this happens, W = 0 (the network remains without worm) 
forever. 

Corollary 3. In any configuration h A2, there exists exactly one processor such 
thatToken{p) is true. 

4.3 Abnormal Waiting Processor Removal 

In the remainder, we assume that A2 always holds. A processor p is said to be 
an abnormal waiting processor if Sp = wait and \Predp\ = 0. Let AWS be the 
set of abnormal waiting processors. Define A3, = A2 A (|AVFS'| = 0). 

Theorem 3 {A2 i> A 3 ). In at most 1 round, AWS = % (no abnormal wait- 
ing processor exists), and, once this happens, AWS = (no abnormal waiting 
processor exists) forever. 
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Proof. Since no action allows any predecessor of a waiting processor to change its 
state, no waiting processor can become an abnormal waiting processor. (Recall 
that once A2 holds, there exists no worm). So, lakhs'! never increases. Moreover, 
no action allows a processor to choose an abnormal waiting processor as succes- 
sor. So, by fairness, every abnormal waiting processor p remains an abnormal 
waiting processor at most one round. Thus, in at most 1 round, AWS = 0. Once 
this happens, since lakhs'! cannot increase, AWS = 0 forever. □ 

4.4 Legitimacy Predicate 

In the remainder, we assume that A3 always holds. At this point of the proof, 
one can notice that even if R is the only possible successor path in the network, 
it may follow edges which are not edges of the first DFS free. Moreover, while R 
does not initiate a new DFTC cycle, the token may not visit some parts of the 
network. So, we need to show that the token infinitely moves among the proces- 
sors, and the system eventually reaches a configuration from which Algorithm 
fVTTC behaves according to its specification. 

Lemma 5 . Let p (p ^ B.) he a processor such that Clean{p) is true at 7^ F As- 
Processor p executes Action C in at most one round. 

Proof. Starting from a configuration 7^ ((h M3) in which Processor p yf R is 
done, p can execute only Action C (unique possible action in the state done). 
So, starting from a configuration in which Cleanijo) is true, either p eventually 
executes Action C or p never executes any action {Sp = done forever). Assume 
by contradiction that p does not change Sp to idle in at most one round. So, 
starting from a configuration 7^ (h M3) where Cleanup) is true, p never exe- 
cutes Action C {Sp = done forever), but Clean{p) does not remain true forever 
(otherwise, by fairness, p eventually executes Action C, which contradicts the 
assumption). Since Clean{p) is true in ji, Sp = done and, either 3 q S Np such 
that Sq = wait or Vg € Np, Sq € {idle, done}, at 7^. 

1. There exists q G Np, Sq = wait at 7^. Since Sp = done at 7i, Locked{q) is 
true at 7^. Since by assumption Sp = done forever, once in the state wait, 
Locked{q) is true forever. So, q is never able to execute an action. Thus, 
Clean{p) remains true forever, which contradicts the assumption. 

2. Vg G Np, Sq G {idle, done} at 7^. No neighbor g in the state idle can execute 
either Action F (because Locked{q) is true) or Action W (this case would 
lead to Case 1). So, every g in the state idle remains idle forever. Any g in 
the state done can only change to idle (by executing Action C). So, either 
every g G Np is eventually idle forever or some g G Np remains in the state 
done forever. In both cases, Clean{p) remains true forever, which contradicts 
the assumption. 

□ 



Lemma 6. Let p he a processor such that Forward{p) is true at 7^ F M3. Pro- 
cessor p executes Action F in at most two rounds. 
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Proof. There are three cases: 

1. p = R and Sp = done at 7 ^. From Lemma El Vg G Np, Sq = idle in at 
most one round. Next, Premissionfp) is continuously true and p eventually 
executes Action F (in one round). 

2. p yf R and Sp = wait at 7 ^. This case is similar to Case 1. 

3. p ytzB. and Sp = idle at 7 ^. Again, there are two cases: 

(a) Locked{p) is false in 7 ^. Then, p eventually executes Action F (in one 
round) . 

(b) Locked{p) is true in 7 ^. Then, From Lemma El Vg G Np, Sq = idle in 
at most one round. Then, once Vg G Np, Sq = idle, Locked{p) is false 
and either Sp remains idle {p does not execute Action W), or p changed 
its states to wait. In both case, in at most one extra round, p executes 
Action F. 

□ 



Lemma 7 . Let p be a processor such that Backward{p) is true at 7^ h A3. 
Processor p executes Action B in at most one round. 

Proof. Let g be the neighbor of p such that Sp = q. Then, Sq = done and no 
action is enabled on g. So, Backward{p) remains true until p executes ActionS. 

□ 

From Lemmas El LemmaQ Theorems DEI Eland Corollary 0 directly follows: 



Theorem 4 (Liveness). The token always circulates among the processors. 

We now define the following for a configuration 7 h A 3 : A 4 = A 3 A Ctottc- 

Lemma 8. Every computation starting from a configuration 7^ F A3 leads to a 
configuration in SC in at most 3 x n rounds. 

Proof. From Theorem 0 the token always circulates among the processors. For 
any processor p, while Sp G Np, no neighbor of p which is done can clean its 
state. So, p cannot choose a neighbor as successor twice during a token circu- 
lation cycle. Since Macro Nextp is based on the order >p on the finite set Np, 
every processor p involved in the DFTC-cycle is eventually done. So, each edge 
pq visited by the token is visited at most twice before reaching a configuration in 
SC. So, without lost of generality, p executes Action F first (to send the token 
to g). Next, p executes Action B (once g is done). Thus, from Lemmas El and 0 
and Theorem 0 again, after at most 3 x n rounds, the system is in a configuration 
in SC. □ 



Theorem 5 (true>A4). Algorithm DFTC verifies the closure and convergence 
property. 

Proof. Directly follows from TheoremsEl 00 0 Lemma 0 and Definition 0 □ 



214 



Franck Petit 



By construction of Algorithm fT>TTC, the token circulates among the pro- 
cessors following the first DFS tree. So, by Definition 0 Algorithm fVTTC, and 
Theorem 0 follows: 

Theorem 6 (Self-Stabilization). Protocol f'DT'T C stabilizes for Specification 
SV D FTC- 

5 Complexities of Algorithm f'DfF'TC 

Each processor p yf R needs n — 1 x {Ap -|- 3) states (n — 1 states for Variable Lp, 
Z\p -I- 3 states for Variable Sp), Ap is the degree of p. 

From Theorem ^ the root has no predecessor in at most one round. From 
Theorem 0 the network contains no worm in at most Lmax rounds. From Theo- 
rem|3 the network contains no abnormal waiting processor in at most one round. 
Note that the fact that no neighbor of the root has the root as a successor is 
independent of the removal of the worm. So, they proceed in parallel, i.e., the 
first round of the Lmax rounds to remove the worms is the same round where 
the predecessors of the root correct their state. Conversely, the round removing 
abnormal waiting processors can occur only after Lmax rounds spent in remov- 
ing the worms. This is due to the fact that the removal of a worm can create 
an abnormal waiting processor (see the proof of Theorem^). In the worst case, 
when the system contains only one token and no abnormal waiting processor, 
the number of rounds required to reach a configuration in SC is the time to 
complete a DFTC cycle, i.e., 3 x n (Lemma EJ. Hence, by adding the number 
of rounds required to remove all abnormal configuration (n -I- 1 rounds) to the 
time to complete a DFTC cycle, we get the following result: 

Theorem 7. Algorithm fDTTC requires n — 1 x {Ap + 3) states per processor. 
In the worst case, the stabilization time of Algorithm fT>J-TC is (4 x n) -I- 1 
rounds. 

Note that Theorem Q shows the stabilization time for the DFTC problem. 
However, if Algorithm fT>J-TC is used to implement the mutual exclusion in 
an arbitrary graph, from Corollary 0 the mutual exclusion property is achieved 
when no worm exists, i.e., in at most n — 1 rounds. 

6 Conclusion 

We presented a simple self-stabilizing DFTC algorithm for general rooted net- 
works. The proposed algorithm stabilizes faster than any algorithm for general 
networks in the current literature. The stabilization time is 0{n). So, the stabi- 
lization time is similar to the time for the token to complete one DFTC cycle. 
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Abstract. A traversal algorithm is a systematic procedure for explor- 
ing a graph by examining all of its vertices and edges. A traversal is 
Eulerian if every edge is examined exactly once. We present a simple de- 
terministic distributed algorithm for the Eulerian traversal problem that 
is space-optimal: each node has exactly d states, where d is the outgoing 
degree of the node, yet may require 0{m?) message exchanges before it 
performs an Eulerian traversal, where m is the total number of edges in 
the network. In addition, our solution has failure tolerance properties: (i) 
messages that are exchanged may have their contents corrupted during 
the execution of the algorithm, and (ii) the initial state of the nodes may 
be arbitrary. 

Then we discuss applications of this algorithm in the context of self- 
stabilizing virtual circuit construction and cut-through routing. Self- 
stabilization 0E] guarantees that a system eventually satisfies its speci- 
fication, regardless of the initial configuration of the system. In the cut- 
through routing scheme, a message must be forwarded by intermediate 
nodes before it has been received in its entirety. We propose a transfor- 
mation of our algorithm by means of randomization so that the resulting 
protocol is self-stabilizing for the virtual circuit construction specifica- 
tion. Unlike several previous self-stabilizing virtual circuit construction 
algorithms, our approach has a small memory footprint, does not require 
central preprocessing or identifiers, and is compatible with cut-through 
routing. 



1 Introduction 

Traversal. A traversal algorithm is a systematic procedure for exploring a graph 
by examining all of its vertices and edges. Traversal algorithms are typically used 
to explore unknown graphs (see 0) and build a map as the graph is visited. The 
case of Eulerian connected directed graphs (where every node has as many in- 
coming edges as outgoing edges) offers best performance since a traversal may 
be performed by visiting each link exactly once (it is then an Eulerian traver- 
sal). In [Z|, a centralized algorithm is proposed that traverses an Eulerian graph 
by visiting at most 2m edges, where m is the overall number of directed edges; 
once the graph map has been built, one can easily use well-known centralized 
algorithms to compute an Eulerian cycle (a cycle that includes all edges exactly 
once), which in turn can be used to perform an Eulerian traversal. In S’ 
distributed solution to the Eulerian cycle construction is given, that requires 
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2m message exchanges and Q{<P‘) memory states at each node, where d is the 
outgoing degree of the node. 



Self- Stabilization. Robustness is one of the most important requirements of mod- 
ern distributed systems. Various types of faults are likely to occur at various 
parts of the system. These systems go through the transient faults because they 
are exposed to constant change of their environment. One of the most inclusive 
approaches to fault tolerance in distributed systems is self-stabilization [BII2|' 
Introduced by Dijkstra in |B|, this technique guarantees that, regardless of the 
initial state, the system will eventually converge to the intended behavior or the 
set of legitimate states. Since most self-stabilizing fault-tolerant protocols are 
non-terminating, if the distributed system is subject to transient faults corrupt- 
ing the internal node state but not its behavior, once faults cease, the protocols 
themselves guarantee to recover in a finite time to a safe state without the need 
of human intervention. This also means that the complicated task of initializing 
distributed systems is no longer needed, since self-stabilizing protocols regain 
correct behavior regardless of the initial state. Furthermore, note that in prac- 
tice, the context in which we may apply self-stabilizing algorithms is fairly broad 
since the program code can be stored in a stable storage at each node so that 
it is always possible to reload the program after faults cease or after every fault 
detection. 



Cut-Through Routing. The cut-through routing is used in many ring networks 
(including IBM Token Ring and FDDI). In this routing scheme, a node can start 
forwarding any portion of a message to the next node on the message’s path be- 
fore receiving the message in its entirety. If this message is the only traffic on the 
path, the total delay incurred by the message is bounded by the transmission 
time (calculated on the slowest link on the path) plus the propagation delay. So, 
the total message delay is proportional to the length of the message and to the 
number of links on the path. Some pieces of the same message may simultane- 
ously be traveling on different links and some other pieces are stored at different 
nodes. As the first bit of the message is transmitted on the links on the message’s 
routing path, the corresponding links are reserved, and the reservation of a link 
is released when the last bit of the message is transmitted on the link. 

This approach removes the need of having a local memory of any node greater 
than the one required to store a bounded number of bits, and also reduces the 
message delay to a small (bounded by the buffer size of the node) value. As 
with the current processors, the time needed for sending/receiving bits to/from 
a communication medium is far greater than the time needed to perform the ba- 
sic computational steps (such as integer calculations, tests, read/write from/to 
registers, etc.), we can assume that a given process can perform a limited number 
of steps between the receipt of two pieces of a message. 

Our Contribution. First, we present a distributed algorithm that is state optimal 
relatively to the Eulerian traversal problem. At every node, exactly d memory 
states are needed, where d is the actual outgoing degree of the node. In addi- 
tion, our algorithm makes very few assumptions about the system on which it 
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is run. For example, nodes need not have unique identifiers or a special distin- 
guished leader. Node variables need not be properly initialized when the protocol 
is started. Moreover, our protocol remains behaving accordingly to the Eulerian 
traversal specification even when messages that are exchanged between nodes 
have their content arbitrarily corrupted from time to time, even during the exe- 
cution of the algorithm. Still, when it is first started, our algorithm may exhibit 
a transient 0{rn?) time period (where m is the overall number of edges) during 
which it performs the first traversal of the network. That first traversal may not 
be Eulerian, but every subsequent traversal is and remains Eulerian. 

Second, and hinted by the failure tolerance properties of our algorithm (mes- 
sage contents corruption, node memory initial corruption), we transform it into a 
self-stabilizing virtual circuit construction algorithm by means of randomization. 
Informally, there exist two kinds of self-stabilizing virtual circuit constructions 
in the literature. Some (as reported in |^) assume bidirectional networks (which 
are a proper subset of Eulerian networks), and construct in a self-stabilizing way 
a spanning tree, then performs an Eulerian tour of this tree (which is trivially 
done). Others {e.g. [Tlll4jl assume only strongly connected networks (which are 
a proper superset of Eulerian networks), but either require some central pre- 
processing or unique node identifiers, have high memory footprint, and are not 
compatible with cut-through routing. In comparison, our virtual circuit construc- 
tion only performs in directed Eulerian networks, yet does not require central 
preprocessing or identifiers, has small memory footprint, and is compatible with 
cut-through routing. Moreover, when used as a lower layer by some other algo- 
rithm using the composition scheme of e.g. uni, the so-constructed virtual ring 
permits a bijective mapping between nodes in the original Eulerian network and 
nodes in the virtual ring network. This bijection allows the upper layer algo- 
rithm to remain unchanged. Then, previously known self-stabilizing cut-through 
algorithms that perform on unidirectional rings {e.g. |aii[I3) can now be run 
on Eulerian networks without any change in their code. 

Overview. In section |2l we present the system model and definitions that will 
be used throughout the paper. In section El a distributed Eulerian traversal al- 
gorithm is presented, along with associated correctness and complexity results. 
Applications to self-stabilization and the cut-through routing are given in Sec- 
tion 0 Concluding remarks are provided in Section El 



2 Model 

A proeessor is a sequential deterministic machine that uses a local memory, a 
local algorithm and input/output capabilities. Such a processor executes its lo- 
cal algorithm, that modifies the state of the processor memory, and send/receive 
messages using the communication ports. An unidirectional communication link 
transmits messages from a processor o (for origin) to a processor d (for destina- 
tion). The link is interacting with one input port of d and one output port of o. 
We assume that links do not loose, reorder or duplicate messages. 

A distributed system is a 2-tuple S = (P, L) where P is the set of processors 
and L is the set of communication links. A distributed system is represented by a 
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directed graph whose nodes denote processors and whose directed edges denote 
communication links. The state of a processor can be reduced to the state of its 
local memory, the state of a communication link can be reduced to its contents, 
then the global system state, called a configuration, is the product of the states 
of memories of processors of P and of contents of communication links in L. The 
set of configurations is denoted by C. 

Our system is not fixed: it passes from a configuration to another when a 
processor executes an instruction of its local algorithm or when a communication 
link delivers a message to its destination. This sequence of reached configurations 
is called a computation, and is a maximal alternating sequence of configurations 
of S and actions. A computation is denoted by Ci, ai, C2, 02, . . . and such that 
for any positive integer i, the transition from Ci to C^+i is done through execu- 
tion of action a^. A projection of a computation Ci, ai, C2, 02, . . . on some set of 
actions A is the sequence of actions ai„ ,ai^, . . . such that each action ,ai^,. . . 
is in A. A finite subsequence of a computation or projection is called a factor. 
Configuration Ci is called the initial configuration of the computation. In the 
most general case, the specification of a problem is by enumerating computations 
that satisfy this problem. Formally, a specification is a set of computations. A 
computation E satisfies a specification A if it belongs to A. 

A self-stabilizing algorithm does not always satisfy its specification. How- 
ever, it seeks to reach a configuration from which any computation will verify 
its specification. A set of configurations B C C is closed if for any b G B, any 
possible computation of system S whose b is initial configuration only contains 
configurations in H. A set of configurations B2 <Z C is an attractor for a set of 
configurations Bi G C if for any b G B\ and any possible computation of S whose 
initial configuration is b, the computation contains a configuration of i?2. Then 
a system S is self-stabilizing for a specification A if there exists a non-empty set 
of configurations C C C such that (closure) any computation of S whose initial 
configuration is in C satisfies A and, (convergence) C is an attractor for C. 

In this paper, we also use a weaker requirement than self-stabilization that 
we call node-stabilization. Informally, a system is node-stabilizing if it reaches 
a correct behavior independently of the initial state of the nodes, yet one may 
assume that the state of the communications links satisfies some global predi- 
cate. Then a system S is node- stabilizing for a specification A if there exists two 
non-empty sets of configurations C gC and Af G C such that (closure) any 
computation of S whose initial configuration is in C satisfies A, (convergence) 
C is an attractor for Af, and (node independence) all possible node states are 
in Af. 

3 State-Optimal Distributed Eulerian Traversal 

In this section, we present a distributed algorithm that stabilizes to an Eulerian 
traversal provided that it is executed starting from a configuration where a sin- 
gle message is present (either at a node or within a communication link) . In the 
following, we call such a configuration a singular configuration. 

While the time complexity of this algorithm is not optimal starting 

from the worst possible initial configuration, while m provides a 0 {m) dis- 
tributed algorithm, where m is the overall number of edges of the network). 
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it does have nice static (it is state optimal at every node) and dynamic (it is 
node-stabilizing) properties. 



3.1 The Algorithm 

We assume 6~{i) and i5'*'(z) denote respectively the incoming and outgoing de- 
gree of node i. As the communication graph is Eulerian, let di = S~{i) = <5“*"(i). 
Moreover, each processor Pi has a Pathi variable, that takes values between 0 
and di — 1. All operations on this variable are done modulo di. Algorithm 13.11 
that is executed upon receipt of a message m, is the same for all processors in 
the system. 

In this section, we assume that the communication between nodes is achieved 
through asynchronous message passing. In Section 0 we discuss the possibility 
of using Algorithm 13.1 l in a synchronous or semi-synchronous system. 



Distributed Eulerian traversal algorithm at node i 
Send m using the outgoing link whose index is Pathi 
Pathi <— Pathi + 1 



Example of Computation. Figure Q a) presents a distributed system whose com- 
munication graph is Eulerian: processors A, C, and F each have one incoming 
link and one outgoing link; processors B, D and E each have two incoming links 
and two outgoing links. The Pathi variable of each processor Pi is denoted by 
an arrow that points to the outgoing link on which the next message will be sent 
to. For example, processor A has just sent message m and processor B (which 
is about to receive m) will retransmit it through its outgoing link 62. We now 
follow the path of message m from ’’initiator” A (actually the latest processor 
that transmitted m). Given the initial Pathi variables configuration, m will go 
through links oq, 61, do and cq before it returns to processor A. The followed 
path is obviously not Eulerian, since links bo, d\, eo, ci and fo have not been vis- 
ited by m. Nevertheless, variables Paths and Paths have changed their values 
during this round of message m: Paths now points to bo and Paths to di. 

Figure Q](&) presents the same distributed system as FigureQ^a), but at the 
second round of message m. Given the configuration of the Pathi variables, mes- 
sage m follows links oq, bo, eo, di, bi, do then cq before returning to processor 
A. Again, the followed path is not Eulerian, since links ei and fo have not been 
followed by m. Yet, the Paths variable has changed value during this second 
round of m: it now point to ei. At the contrary, variables Paths and Paths are 
back to the values they had at the beginning of second round {i.e. bo and di, 
respectively). 

Figure ^c) presents the same system at the beginning of third round. Given 
the Pathi variables configuration, the message will follow links oq, bo, ei, fo, 
eo, di, bi, do then cq before returning to processor A. The followed path is Eu- 
lerian, since every link is traversed exactly once, and that message m is back 
to the ’’initiator” A. Moreover, Pathi variables hold the same values as at the 
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(a) The system at the beginning of 
first round 



(b) The system at the beginning of 
second round 




(c) The system at the beginning of 
third round 




(d) The system at the beginning of 
fourth round 



Fig. 1. Example of computation of Algorithm ft. II 



beginning of third round (see Figure which means that the fourth round 

will be identical to the third. Consequently, message m will follow the very same 
Eulerian path infinitely often. 

3.2 Proof of Correctness 

We wish to prove that Algorithm !, S. Il ls node-stabilizing for the Eulerian traver- 
sal problem. Each of the following lemmas assume that the algorithm is started 
from a singular configuration as defined below: 

Definition 1. A configuration C is singular if it contains exactly one message 
(either traversing a node or a communication link). 
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The liveness lemma (LemmaQ shows that Send actions through a particular 
link appear infinitely often in any computation, and so for any particular link. 
The following lemmas make use of it and consider computation factors that be- 
gin and end with the same Send action. The uniqueness lemma (LemmaEI shows 
that between any two successive Send actions on the same link, no other link 
may be related to more than one Send action. The completeness lemma (Lemma 
OJ shows that after every link is related to at least one Send action, between any 
two successive Send action on a particular link, every other link is related to 
exactly one Send action. Finally, the legitimacy lemma (Lemma 0 shows that 
these Send actions appear always in the same order, and thus that the message 
performs an Eulerian traversal forever. 

Lemma 1. Starting from a singular configuration, every link is visited by the 
message infinitely often. 

Proof. Suppose there exists a link Ci^j (allowing Pi to send messages to Pj) 
that is not visited infinitely often starting from a singular configuration. From 
Algorithm l,S. 1 1 if processor Pi executes Send actions infinitely often, it does so 
on every outgoing link. Then, if Pi did not execute a Send action infinitely often 
on link then Pi has executed Send action only a finite number of times in 
the whole computation, and thus Pi received the message only a finite number 
of times. Every incoming link Pi are then in the same case as Ci^j, and have not 
been visited infinitely often. Applying the same reasoning again and since the 
network is finite and strongly connected, no link has been visited infinitely often. 
This contradicts the fact that Algorithm tS.1l may not deadlock starting from a 
singular configuration (since every receipt of a message implies an immediate 
Send action). 



Notation 1. For the sake of simplicity in the proof of the following lemmas, 
we arbitrarily number links from 1 to L (the number of links in the system) and 
denote by Ij a Send action through link number j . Thus Ij and Ip denote Send 
action on different links if j ^ p and on the same link if j = p. 



Lemma 2. Starting from a singular configuration, between any two Send ac- 
tions on the same link, no other link is associated with a Send action twice. 

Proof. Let us consider a particular computation e of Algorithm IS. I l and its pro- 
jection p on Send actions. From Lemma ^ action li appears infinitely often in 
p. Let us study a factor f of p that starts and ends with li and such that / does 
not contain any other l\. 

We do not discuss the trivial case / = l\li where no other Send action that 
on the unique link is possible (in this case, the Eulerian traversal is trivially 
satisfied). Now assume that between the two l\ actions, / contains some action 
Ik twice: / = I 1 I 2 . . .Ik . ■ . Inh . . .l\. In more details, while the n first actions of 
/ are pairwise distinct, the second Ik is the first action that appears twice in /. 
We will show that the existence of Ik leads to a contradiction. 

Action Ik is a Send action on link Ci^j. If Pi performed twice a Send action 
involving (the two occurrences of Ik), then Pi received the message i5+(z)-|-l 
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times. In turn, if Pi received -I- 1 times the message, and since the graph is 
Eulerian, Pi received it 5~{i) + 1 times and thus twice from the same incoming 
link. Then it follows that two Send actions occurred on this incoming link. In 
the writing of / = I 1 I 2 ■ ■ - Ik ■ ■ - Inlk ■ ■ - h, this means that some two actions be- 
tween li and In are identical, while our hypothesis claims that they are pairwise 
distinct. Therefore every factor / of p of length n -I- 1 that starts and ends with 
h can be written as f = h .. . lpl\. 

Lemma 3. Starting from a singular configuration, and after every link has been 
visited by the message, between any two Send actions on the same link, every 
other link is associated with a Send action exactly once. 

Proof. Let us consider a computation of the algorithm. From Lemma ^ this 
computation contains each Send action infinitely often. It is then possible to 
write its projection p on Send actions as: tolitilit 2 . . . htnh . . . where to contains 
at least once each of the lke{i,...,L} and where none of the ti>i contains l\. From 
Lemma 0 it is impossible that any of the ti>i contains the same Send action 
twice. Therefore, all factors ti>\ are of length at most L — 1. 

Suppose now that the factor tj (j > 1) is of length strictly lower than L — 1 
and let us denote by Ip {p 1) the Send action that does not appear in tj. 
Lemma ^ensures that Ip appears infinitely often in the computation. Thus there 
exists a smallest k > j such that Ip is a send action of factor tk. Moreover, since 
Ip appears by definition in to, there also exists a greatest m < j such that Ip is 
a Send action of factor tm. 

Consequently, the projection p has a factor tmhtm+i . . .tj . . . tk-ihtk where 
Ip {p 1) does not appear in any of the tig{m-i-i,...,fc-i} but appears in t^ and 
in tk'. 



The Send action li then appears twice between two successive Send actions 
Ip, which contradicts Lemma 0 

In conclusion, in the projection factor tohtiht 2 . . . htnh . . ■, every ti>i con- 
tains only different Send actions and is of size L — 1. In other terms, after every 
link has been visited by the message {i.e. after to), between any two Send actions 
on the same link h, every other link is associated with a Send action exactly once 
(every ti>i contains each lk^{2,...,L} exactly once). 

Lemma 4. Starting from a singular configuration, and after every link has been 
visited by the message, between any two Send actions on the same link, every 
other link is associated with a Send action exactly once and in the same order. 

Proof. Let us consider a computation of the algorithm. From Lemma 0 this 
computation contains each Send action infinitely often. It is then possible to 
write its projection p on Send actions as: tohtiht 2 ■ ■ ■ htnh ■ ■ ■ where to contains 
at least once each of the lke{i....,L} where (from Lemma 0) every con- 
tains exactly once each of the lke{2,...,L}- In niore details, for tj = hh . . .Il we 



. . . Ip . . . htm+l . . . tj . . . tk—lh . . . Ip . . . 
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can write tj+i as la{ 2 )^<j{ 3 ) ■ ■ • ^cr(L)) where cr is a permutation. Assume that there 
exists a smallest integer giin {2, . . . , L} and such that (r(gi) ^ qi. Then there 
exists an integer Q 2 {qi < 92 < A) and such that a{qi) = q 2 - 
We are now able to rewrite a factor of the projection p as: 

I 2 • • • . . . /g2 — 1^(J2 • • - ^lW^2 • • • — 1 + 1) ■ ■ ■ ^a{q2 — l)^qi ' ' ' ^cr(L) 

tj tj + 1 

Then, between two Iq-^ , LemmaEIis contradicted. Indeed, between two successive 
occurrences of Iq ^ , we find two occurrences of Iq^ ■ This contradiction permits to 
prove that every possible permutation cr is reduced to the identity and that the 
projection p can be written as . After every link has been visited by the 

message head (after to), between any two Send actions on the same link li, every 
other link is associated with a Send action exactly once and in the same order 
as in ti. 



Theorem 1. Starting from an singular configuration, Algorithm IS.1\ stabilizes 
to an Eulerian traversal. 

Proof. In Lemma0, we proved that any computation has a factor of the projec- 
tion p on Send actions of the form to{l\l 2 ■ ■ ■ Il)‘^, where to is finite. Consequently, 
an Eulerian traversal through links 1 to L is performed infinitely often after a 
finite number of message exchanges. 



3.3 Complexity 

In order to know the outgoing link to which a message is to be sent, a processor 
Pi requires di states. Similarly, to know the incoming link by which a message 
was receipt. Pi requires di states. We show that a di states memory per processor 
is necessary for distributed Eulerian traversal. 

In this section, we do not consider the space needed to manage the lower 
layer data link protocol. Our lower bound is for the size of the “routing table” 
that is needed at each node to perform an Eulerian traversal. We also assume 
that messages do not carry meaningful information that could be used to route 
them properly. 

Lemma 5. Every distributed Eulerian traversal algorithm requires di states at 
processor Pi . 

Proof. Suppose that there exists an Eulerian traversal algorithm (deterministic 
or probabilistic, stabilizing or non-stabilizing) such that there exists at least one 
processor Pi that uses at most di — 1 states. In an Eulerian network, every pro- 
cessor Pi has at least one incoming and one outgoing edges. Thus for any Pi, 
di > 1. If di = 1, then if Pi has less than one state, it may execute no code. 
Now, if di > 2, assume that Pi has at most di — 1 states. 

In this last case, there exists at least two incoming links ci and C 2 of Pi 
by which Pi received the message and such that Pi had to perform a Send ac- 
tion using the same outgoing link C 3 at least twice (in case Pfs algorithm is 
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deterministic) or at least twice with probability e > 0 (in case Pi’s algorithm 
is probabilistic). Then the message forwarding scheme is not Eulerian, since at 
any point in the computation, it is either certain or possible that two incoming 
links are not forwarded to two different outgoing links. 

A direct corollary of this lemma is the following theorem. 

Theorem 2. Alaorithm Vi. 1\ is state optimal. 

For the time complexity part, a direct consequence of Lemma Elis the follow- 
ing theorem. 

Theorem 3. Alaorithm \3.1\ verforms its first Eulerian traversal within 0{m?) 
message exchanges. 

4 Applications 

In this section, we investigate applications of Algorithm 13.1 1 in the context of 
self-stabilization. Strictly speaking. Algorithm 13.11 is not self-stabilizing, since 
its correct behavior requires that it is started from a singular configuration (a 
configuration where a single message is present). However, it does stabilizes to 
an Eulerian traversal independently of nodes’ initial states and messages’ ac- 
tual contents. Randomization allows Algorithm 13. ll to be self-stabilizing without 
the overkill of using a self-stabilizing mutual exclusion algorithm to guaran- 
tee uniqueness of the message; then the resulting self-stabilizing virtual circuit 
construction algorithm shows some interest particularly in the context of cut- 
through routing, where nodes must retransmit messages before they have been 
received entirely. Due to space constraints, all proofs in this section are only 
informally sketched, yet the interested reader may refer to m- 

4.1 Reaching a Singular Configuration 

Since HH showed that message-passing self-stabilizing algorithms require time- 
outs to handle the case where no message is initially present in the network, 
we concentrate here on eliminating superfluous messages in the case where the 
number of initially present messages is greater or equal to 2. Our solution is by 
giving different randomized speeds to messages so that in an infinite computa- 
tion, it is possible that two messages are present at the same node at the same 
time. The result of that event is the node discarding every message but one. 

Multiple Speeds. If the system is asynchronous (the communication time be- 
tween the origin and the destination of a link may be arbitrary), we assume a 
random distribution on communication time, so that the speeds of the messages 
are actually different. 

If the system is synchronous (the communication time between the origin 
and the destination of a link is bounded by 1), we can split every computation 
in global steps during which every message is sent and received, and assume that 
nodes dispose of a random Boolean variable. At each global step, every node that 
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receives a message consults its random variable: if the random variable returns 
true, then the node holds the message one more global step; otherwise it sends 
the message immediately. Note that a node may hold a message for at most one 
global step, and that the induced relative speeds on messages are now different 
and randomized. This technique has a low memory footprint since a node needs 
only to know if it should wait one more global step (one bit is sufficient). 

Message Decreasing. Now assuming that messages do have different speeds, we 
show that Algorithm IH.ll stabilizes to a single message configuration. Indeed, 
starting from an arbitrary configuration with at least two messages, two cases 
may appear: 

1. At least two messages follow the same circuit that goes through all links in 
the Eulerian graph. By the probabilistic setting, these two messages have 
different speeds and thus starting from any configuration, there is a positive 
probability that they are at the same node, which will discard all messages 
but one, so that the overall number of messages decreases. 

2. At least two messages follow two different circuits, but from the rotating 
exploration nature of Algorithm 13.11 every edge of the system is visited in- 
finitely often. Thus, these two circuits share a common node i. Then, from 
the probabilistic speeds of these two messages, starting from any configura- 
tion, there is a positive probability that at some point in the computation, 
they are at the same node, which will discard all messages but one. Then 
the overall number of messages decreases. 

Since at any time there is a positive probability that, in a finite number of 
steps, the number of messages decreases if it is strictly greater that 1, then by 
the main theorem of |3| , after finite time a single message remains in the system 
with probability 1. 



4.2 Towards a Virtual Ring Construction 

Since self-stabilization was first presented by Dijkstra in 1974 (see |5|), which 
provided three mutual exclusion algorithm on unidirectional ring networks, nu- 
merous works in self-stabilization were proposed on unidirectional rings (see |1 3) 
or 0). Therefore, it is interesting to provide a scheme that permits to run such 
algorithms on more general networks, by constructing a virtual ring topology on 
top of which the original algorithm is run. 

Many self-stabilizing solutions to the virtual ring construction problem ex- 
ists for bidirectional networks (which are a proper subset of Eulerian networks); 
many of these works first construct a spanning tree of the graph, then perform 
an Eulerian tour of the spanning tree (see 0). In general strongly connected 
networks (which are a proper superset of Eulerian networks), approaches by 
Tchuente ( El) and Alstein et al. ( Q) make use of a central preprocessing 
of the communication graph, or assume that nodes are given unique identifiers. 
Two drawbacks of pun are the high memory consumption and the fact that 
nodes simulate several processes in the virtual ring, which may be incorrect for 
some applications (such as 0). In addition, and to the best of our knowledge. 



On a Space-Optimal Distributed Traversal Algorithm 227 



none of the aforementioned approaches can be used in non-ring networks when 
the cut-though routing scheme is used. 

Our solution circumvents many of the previously mentioned drawbacks: the 
class of Eulerian graphs that we consider is intermediate between bidirectional 
and strongly connected graphs classes, we do not require central preprocessing 
nor unique node identifiers, memory consumption is low (0(d) at each node, 
where d is the outgoing degree of the node) , and efficient cut-through routing is 
supported. 

Cut-Through Routing Compliance. The two main reasons for the cut-through 
routing compliance are the following: (i) since the underlying graph is Eulerian, 
each node has as many incoming links as outgoing links, so when a message ar- 
rives, it may be forwarded immediately to a free outgoing link, and (ii) since the 
message contents is unused in the forwarding scheme, no additional processing is 
needed before giving control to the composed cut-through algorithm (that could 
be any of pilM TTjb 

Virtual Circuit Bisection. In addition, the Eulerian property of the traversal 
guarantees that each link is visited exactly once at each traversal. Thus, if a 
node has d outgoing links, then the link that is locally labeled 0 at this node is 
visited exactly once at each Eulerian traversal, no matter how the local labeling 
on outgoing links is performed. Assume now that our Eulerian traversal algo- 
rithm is run to build a virtual circuit that is used by an upper layer application 
(such as |2j). If the upper layer application is activated only when a message 
arrives and the Pathi variable equals 0, then it is guaranteed that this upper 
application is activated exactly once at each Eulerian traversal. This means that 
at the upper application level, there is a bijection between the nodes in the ac- 
tual system and the nodes in the virtual ring system. This bijection is usually 
required for sake of correctness or service time guarantee. 

5 Concluding Remarks 

We presented a state-optimal distributed solution to the Eulerian traversal prob- 
lem. Each node only needs d memory states, where d is the node outgoing degree. 
Our algorithm also presents some failure resilience properties: it is independent 
of the message contents and after 0{m^) message exchanges, it provides an in- 
finite Eulerian traversal whatever the initial configuration of the nodes may be 
(it is node-stabilizing). As such. Algorithm l,S. II proved useful to ensure mutual 
exclusion in uniform networks (see 0) and mobile agent traversal for sake of 
self-stabilization (see [S]). 

The message content independence was shown useful in the context of cut- 
through routing, since a node need not know the contents of a message to prop- 
erly route it. The insensitivity to node initialization was extended by means of 
randomization so that the resulting system is self-stabilizing. This solution per- 
mits to avoid high memory consumption and preprocessing that were required 
by previous approaches. 
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